Description: Arc2Face: A Foundation Model for ID-Consistent Human Faces
stable diffusion (185) arc2face (1) id-embeddings (1)
TL;DR : We introduce a large dataset of high-resolution facial images with consistent ID and intra-class variability, and an ID-conditioned face model trained on it, which: 🔥 generates high-quality images of any subject given only its ArcFace embedding, within a few seconds 🔥 offers superior ID similarity compared to existing text-based models 🔥 is built on top of Stable Diffusion and can be extended to different input modalities, e.g. pose/expression
This paper presents Arc2Face, an identity-conditioned face foundation model, which, given the ArcFace embedding of a person, can generate diverse photo-realistic images with an unparalleled degree of face similarity than existing models. Despite previous attempts to decode face recognition features into detailed images, we find that common high-resolution datasets (e.g. FFHQ) lack sufficient identities to reconstruct any subject. To that end, we meticulously upsample a significant portion of the WebFace42M
We use a straightforward design to condition the pre-trained Stable Diffusion on ID features. The ArcFace embedding is processed by the text encoder using a frozen pseudo-prompt for compatibility, allowing projection into the CLIP latent space for cross-attention control. Both the encoder and UNet are optimized on a million-scale FR dataset (after upsampling), followed by additional fine-tuning on high-quality datasets, without any text annotations.