Figure 1. GLIGEN enables versatile grounding capabilities for a frozen text-to-image generation model.
Large-scale text-to-image diffusion models have made amazing advances. However, the status quo is to use text input alone, which can impede controllability. In this work, we propose GLIGEN, G rounded- L anguage-to- I mage Gen eration, a novel approach that builds upon and extends the functionality of existing pre-trained text-to-image diffusion models by enabling them to also be conditioned on grounding inputs . To preserve the vast concept knowledge of the pre-trained model, we freeze all of its weights an
Figure 2. Gated Self-Attention is used to fuse new grounding tokens.