x-decoder-vl.github.io - Towards a Generalized Multi-Modal Foundation Model

Description: Towards a Generalized Multi-Modal Foundation Model

nerf (195) d-nerf (90) nerfies (89)

Example domain paragraphs

Figure 1. X-Decoder is a single model trained to support a wide range of vision and vision-language tasks.

We present X-Decoder, a generalized decoding pipeline that can predict pixel-level segmentation and language tokens seamlessly. X-Decoder takes as inputs two types of queries: (i) generic non-semantic queries and (ii) semantic queries induced from text inputs, to decode different pixel-level and token-level outputs in the same semantic space. With such a novel design, X-Decoder is the first work that provides a unified way to support all types of image segmentation and a variety of vision-language (VL) task

X-GPT: Connecting generalist X-Decoder with GPT-3 Instruct-X-Decoder: Object-centric instructional image editing Model Our X-Decoder is unique for three critical designs :

Links to x-decoder-vl.github.io (7)