Description: Making Large Multimodal Models Understand Arbitrary Visual Prompts
multimodal mode with arbitrary visual prompts (1)
While existing large vision-language multimodal models focus on whole image understanding, there is a prominent gap in achieving region-specific comprehension. Current approaches that use textual coordinates or spatial encodings often fail to provide a user-friendly interface for visual prompting. To address this challenge, we introduce a novel multimodal model capable of decoding arbitrary visual prompts. This allows users to intuitively mark images and interact with the model using natural cues like a "re
ViP-LLaVA directly overlays the visual prompt with the original image, then feeds the image to the multimodal model. Our approach shows several benefits: Simply Design: . No specific region encoding module is needed. Generalize to Arbitrary Visual Prompts: . Users can draw arbitrary visual prompts such as scribble, circle and point. Please check out our [Model Zoo] .
During Trianing, we use 8 diverse visual prompts, including mask contour, ellipse, bounding box, triangle, scribble, point, arrow, and mask. Note that the prompts not only have diverse shapes, but they also have diverse colors, transparency values, widths, scales, and directions.