UPDATE: We now have a PyTorch implementation that supports LoRA for low-memory training here !
We train diffusion models directly on downstream objectives using reinforcement learning (RL). We do this by posing denoising diffusion as a multi-step decision-making problem, enabling a class of policy gradient algorithms that we call denoising diffusion policy optimization (DDPO). We use DDPO to finetune Stable Diffusion on objectives that are difficult to express via prompting, such as image compressibility, and those derived from human feedback, such as aesthetic quality. We also show that DDPO can be
We use feedback from a large vision-language model, LLaVA , to automatically improve prompt-image alignment.