Description: InstantStyle.
create images in any style! (1)
Tuning-free diffusion-based models have demonstrated sig- nificant potential in the realm of image personalization and customiza- tion. However, despite this notable progress, current models continue to grapple with several complex challenges in producing style-consistent image generation. Firstly, the concept of ’style’ is inherently underde- termined, encompassing a multitude of elements such as color, material, atmosphere, design, and structure, among others. Secondly, inversion- based methods are prone
Separating Content from Image. Instead of employing complex strategies to disentangle content and style from images, we take the simplest approach to achieving similar capabilities. Compared with the underdetermined attributes of style, content can usually be represented by natural text, thus we can use CLIP’s text encoder to extract the characteristics of the content text as content epresentation. At the same time, we use CLIP’s image encoder to extract the features of the reference image as we did in prev
Injecting into Style Blocks Only. Empirically, each layer of a deep network captures different semantic information the key observation in our work is that there exists two specific attention layers handling style. Specifically, we find up blocks.0.attentions.1 and down blocks.2.attentions.1 capture style (color, material, atmosphere) and spatial layout (structure, composition) respectively. We can use them to implicitly extract style information, further preventing content leakage without losing the streng