robot-vila.github.io - ViLa

Description: Look Before You Leap: Unveiling the Power of GPT-4V in Robotic Vision-Language Planning

Example domain paragraphs

In this study, we are interested in imbuing robots with the capability of physically-grounded task planning. Recent advancements have shown that large language models (LLMs) possess extensive knowledge useful in robotic tasks, especially in reasoning and planning. However, LLMs are constrained by their lack of world grounding and dependence on external affordance models to perceive environmental information, which cannot jointly reason with LLMs. We argue that a task planner should be an inherently grounded

Given a language instruction and current visual observation, we leverage a VLM (GPT-4V) to comprehend the environment scene through chain-of-thought reasoning, subsequently generating a sequence of actionable steps. The first step of this plan is then executed by a primitive policy. Finally, the step that has been executed is added to the finished plan, enabling a closed-loop planning method in dynamic environments.

V i L a excels in complex tasks that demand an understanding of spatial layouts or object attributes. This kind of commonsense knowledge pervades nearly every task of interest in robotics, but previous LLM-based planners consistently fall short in this regard.

Links to robot-vila.github.io (2)

tongzhangthu.github.io Tong Zhang
yingdong-hu.github.io Yingdong Hu