Description: LLARVA introduces a pioneering instruction tuning method for robotic applications, unifying multiple learning tasks through structured prompts and innovative use of 2-D visual traces to enhance the alignment between vision and action.
robot learning (26) lmms (8) vision action instruction tuning (1)
In recent years, instruction-tuned Large Multimodal Models (LMMs) have been successful at several tasks, including image captioning and visual question answering; yet leveraging these models remains an open question for robotics. Prior LMMs for robotics applications have been extensively trained on language and action data, but their ability to generalize in different settings has often been less than desired. To address this, we introduce LLARVA, a model trained with a novel instruction tuning method that