Description: Domain agnostic Encoder for Fast Personalization of Text-to-Image Models
lora (207) text-to-image (42) textual inversion (6) personalized generation (4) dreambooth (3)
Text-to-image (T2I) personalization allows users to guide the creative image generation process by combining their own visual concepts in natural language prompts. Recently, encoder-based techniques have emerged as a new effective approach for T2I personalization, reducing the need for multiple images and long training times. However, most existing encoders are limited to a single-class domain, which hinders their ability to handle diverse concepts. In this work, we propose a domain-agnostic method that doe
We pre-train an encoder to predict word-embeddings and Low-Rank weight updates. Our method consists of a feature-extraction backbone which follows the E4T approach and uses a mix of CLIP-features from the concept image, and denoiser-based features from the current noisy generation. These features are fed into an embedding prediction head, and a hypernetwork which predicts LoRA-style attention-weight offsets. During inference, we predict LoRA weights and word-embedding and tune those on the target subject.
Avoiding subject overfitting via dual-path adaptation: We employ a dual-path adaptation approach where each attention branch is repeated twice, once using the soft-embedding and the hypernetwork offsets, and once with the vanilla model and a hard-prompt containing the embedding's nearest neighbor. These branches are linearly blended to better preserve the prior. The latter path preserves the model's prior, while the new adapted branch adapts to the target concept.