Description: Scaling Unified Multimodal LLMs with Mixture of Experts
multi-modal model with moe (1)
The Mixture of Experts (MoE) architecture has emerged as a promising approach for scaling up Large Language Models (LLMs), facilitating more efficient training and inference. Regarding this potential, our work extends the MoE architecture to develop Uni-MoE , a unified Multimodal LLM designed to process a wide array of modalities, including audio, speech, images, text, and video. Specifically, our methodology enriches the transformer architecture of LLMs by incorporating multimodal experts, comprising: 1).
Figure 1:Architecture of Uni-MoE . By connecting LLM with multimodal encoders, Uni-MoE shows unified multimodal understanding capability. It mainly employs the MoE architecture to achieve stable and powerful performance on any multi-modal information input.
Cross-Modality Alignment. In the initial stage, we aim to establish connectivity between different modalities and linguistics. We achieve this by constructing connectors that translate various modal data into soft tokens within the language space. The primary objective here is minimizing generative entropy loss. As illustrated in the upper section of Figure 1, the LLM is optimized to generate descriptions for inputs across modalities and only the connectors are subject to training. This approach ensures sea