Description: Towards a Generalized Multi-Modal Foundation Model
nerf (195) d-nerf (90) nerfies (89)
Figure 1. X-Decoder is a single model trained to support a wide range of vision and vision-language tasks.
We present X-Decoder, a generalized decoding pipeline that can predict pixel-level segmentation and language tokens seamlessly. X-Decoder takes as inputs two types of queries: (i) generic non-semantic queries and (ii) semantic queries induced from text inputs, to decode different pixel-level and token-level outputs in the same semantic space. With such a novel design, X-Decoder is the first work that provides a unified way to support all types of image segmentation and a variety of vision-language (VL) task
X-GPT: Connecting generalist X-Decoder with GPT-3 Instruct-X-Decoder: Object-centric instructional image editing Model Our X-Decoder is unique for three critical designs :