Description: DESCRIPTION META TAG
keywords should be placed here (79)
In this paper, we investigate the use of diffusion models which are pre-trained on large-scale image-caption pairs for open-vocabulary 3D semantic understanding. We propose a novel method, namely Diff2Scene, which leverages frozen representations from text-image generative models, along with salient-aware and geometric-aware masks, for open-vocabulary 3D semantic segmentation and visual grounding tasks. Diff2Scene gets rid of any labeled 3D data and effectively identifies objects, appearances, materials, lo
Illustration of open-vocabulary 3D semantic scene understanding. We propose Diff2Scene, a 3D model that performs open-vocabulary semantic segmentation and visual grounding tasks given novel text prompts, without relying on any annotated 3D data. By leveraging discriminative-based and generative-based 2D foundation models, Diff2Scene can handle a wide variety of novel text queries for both common and rare classes, like “desk” and “soap dispenser”. It can also handle compositional queries, such as “find the w
Illustration of open-vocabulary 3D perception methods. (a) Directly minimizing the per-point feature distance between the CLIP-based model and the tuned 3D model. (b) Directly using a 3D mask proposal network trained on labeled 3D data to produce class-agnostic masks, and then pool corresponding representations from the CLIP feature map. (c) The proposed mask distillation approach, namely Diff2Scene, that uses Stable Diffusion and performs mask-based distillation. Diff2Scene leverages the semantically-rich