Description: Bubo GPT
LLMs have demonstrated remarkable abilities at interacting with humans through language, especially with the usage of instruction-following data. Recent advancements in LLMs, such as MiniGPT-4, LLaVA, and X-LLM, further enlarge their abilities by incorporating multi-modal inputs, including image, video, and speech. Despite their effectiveness at generating precise and detailed language understanding of the given modality signal, these LLMs give up the ability to ground specific parts of inputs, thus only co
As the figure shown, we perform joint multi-modal understanding and chatting for text, vision and audio, which is achieved by learning a shared representation space that aligns well with pre-trained Vicuna . We also build an off-the-shelf visual grounding pipeline to explore the fine-grained relation between different visual objects and modalities. The framework of BuboGPT. table.GeneratedTable { width: 100%; background-color: #ffffff; border-collapse: collapse; border-width: 2px; border-color: #c1c4c5; bor
BuboGPT connects different modality Q-Former with pre-trained large language model Vicuna , using a simple projection matrix. We consider a two-stage instruction-tuning procedure: Stage 1: Single-modal Pre-training . We train the corresponding modality Q-Former and linear projection layer on a large number of modality-text paired data. Stage 2: Multi-Modal Instruct Tuning . We curate a high-quality multi-modal instruction-following dataset to fine tune only the linear projection layer: Image-Text : We emplo