Description: SpatialVLM: a visual language model from Google DeepMind that natively understand and reason about spatial relationships.
llm (257) multimodal (95) large language model (41) vlm (31) multi-modal (12) visual language model (2) spatial reasoning (1) distance estimation (1)
Motivation: Humans effortlessly determine spatial relationships, such as the positioning of objects relative to each other or estimating distances and sizes. This natural proficiency in direct spatial reasoning tasks contrasts with the current limitations of VLMs. Can we imbue VLMs with spatial reasoning abilities akin to those of humans?
Key insight: We hypothesize that the limited the spatial reasoning abilities of current VLMs is not due to a fundamental limitation of their architecture, but rather is a limitation in common datasets available at scale on which such models are trained. We co-train a multi-modal large language model on synthetic spatial data to investigate this hypothesis.
We develop an automatic 3D spatial VQA data generation framework that lifts 2D images into metric scale 3d point clouds. We scales the data pipeline up to 2 billion VQA examples on 10 million real-world images.