Description: Multimodal Graph Benchmark.
Associating unstructured data with structured information is crucial for real-world tasks that require relevance search. However, existing graph learning benchmarks often overlook the rich semantic information inherent in each node. To bridge such gap, we introduce the Multimodal Graph Benchmark (MM-Graph), the first comprehensive multi-modal graph benchmark that incorporates both textual and visual information. MM-Graph surpasses previous efforts, which have primarily focused on text-attributed graphs with
Visualization of our Multimodal Graph Benchmark. All nodes of our benchmark have both visual and text features. (a) Amazon-Sports: The image and text come from the original image and title of the sports equipment. (b) Goodreads-LP: The image comes from the cover of the book. We do not show the text features of Goodreads-LP since the book description is very long. (c) Ele-fashion: The image and text come from the original image and title of the fashion product.
We present the first method capable of localizing novel objects in 3D scenes using Neural Radiance Field (NeRF) and Large Language Models (LLMs) through iterative, natural language-based interactions. While grounding dialog to 2D images in multimodal dialog systems has been extensively studied, little work has been done in 3D object grounding. Furthermore, the existing 3D object grounding approaches predominantly rely on rigid, phrase-based manners, a stark contrast to how humans naturally refer to objects