x-decoder-vl.github.io - Towards a Generalized Multi-Modal Foundation Model

Description: Towards a Generalized Multi-Modal Foundation Model

Example domain paragraphs

Figure 1. X-Decoder is a single model trained to support a wide range of vision and vision-language tasks.

We present X-Decoder, a generalized decoding pipeline that can predict pixel-level segmentation and language tokens seamlessly. X-Decoder takes as inputs two types of queries: (i) generic non-semantic queries and (ii) semantic queries induced from text inputs, to decode different pixel-level and token-level outputs in the same semantic space. With such a novel design, X-Decoder is the first work that provides a unified way to support all types of image segmentation and a variety of vision-language (VL) task

X-GPT: Connecting generalist X-Decoder with GPT-3 Instruct-X-Decoder: Object-centric instructional image editing Model Our X-Decoder is unique for three critical designs :

Links to x-decoder-vl.github.io (7)

chunyuan.li Chunyuan Li
jwyang.github.io Jianwei Yang’s Homepage
gligen.github.io GLIGEN:Open-Set Grounded Text-to-Image Generation.
maureenzou.github.io Xueyan's Homepage
harkiratbehl.github.io Harkirat Behl
zdou0830.github.io Zi-Yi Dou
multimodal-react.github.io MM-ReAct: Prompting ChatGPT for Multimodal Reasoning and Action