Description: STIC: Enhancing LVLMs with Self-Training on Image Comprehension
Framework overview of STIC , a two-stage self-training algorithm focusing on the image comprehension capability of the LVLMs . In Stage 1 , the base LVLM self-constructs its preference dataset for image description using well-designed prompts, poorly-designed prompts, and distorted images. In Stage 2 , a small portion of the previously used supervised fine-tuning (SFT) data is recycled and infused with model-generated image descriptions to further fine-tune the base LVLM.
Large vision language models (LVLMs) integrate large language models (LLMs) with pre-trained vision encoders, thereby activating the perception capability of the model to understand image inputs for different queries and conduct subsequent reasoning. Improving this capability requires high-quality vision-language data, which is costly and labor-intensive to acquire. Self-training approaches have been effective in single-modal settings to alleviate the need for labeled data by leveraging model's own generati
To address this, we introduce S elf- T raining on I mage C omprehension ( STIC ), which emphasizes a self-training approach specifically for image comprehension . First, the model self-constructs a preference dataset for image descriptions using unlabeled images. Preferred responses are generated through a step-by-step prompt, while dis-preferred responses are generated from either corrupted images or misleading prompts. To further self-improve reasoning on the extracted visual information, we let the model