llava-vl.github.io - LLaVA

Description: Visual Instruction Tuning

multimodal chatbot (10)

Example domain paragraphs

Instruction tuning large language models (LLMs) using machine-generated instruction-following data has improved zero-shot capabilities on new tasks in the language domain, but the idea is less explored in the multimodal field. Multimodal Instruct Data . We present the first attempt to use language-only GPT-4 to generate multimodal language-image instruction-following data. LLaVA Model . We introduce LLaVA ( L arge L anguage- a nd- V ision A ssistant) , an end-to-end trained large multimodal model that conne

For each subset, we visualize the root noun-verb pairs for the instruction and response. For each chart, please click the link for the interactive page to check out the noun-verb pairs whose frequency is higher the given number.

LLaVa connects pre-trained CLIP ViT-L/14 visual encoder and large language model Vicuna , using a simple projection matrix. We consider a two-stage instruction-tuning procedure: Stage 1: Pre-training for Feature Alignment . Only the projection matrix is updated, based on a subset of CC3M. Stage 2: Fine-tuning End-to-End. . Both the projection matrix and LLM are updated for two different use senarios: Visual Chat : LLaVA is fine-tuned on our generated multimodal instruction-following data for daily user-orie

Links to llava-vl.github.io (35)