
Generative text-to-image latent diffusion models (LDMs) have recently achieved significant advances, producing state-of-the-art image generation quality. Fine-tuning LDMs to align image outputs with human preferences is a major concern for downstream applications. Traditionally, this fine-tuning relies on supervised learning with large datasets, which is impractical for scenarios with limited data. As an alternative, some on-policy policy gradient reinforcement learning (RL) algorithms have shown promise, but their applicability is restricted due to the need for explicit reward functions to guide image scoring during fine-tuning. To overcome the limitations of existing fine-tuning methods for LDMs, we propose Off-policy On-policy Optimization (O2O), a novel policy gradient RL algorithm. Unlike conventional RL methods that depend on explicit reward functions, O2O introduces a hybrid training strategy that combines generated images in on-policy learning and real images from datasets in off-policy learning. This approach enables effective alignment of LDMs to human preferences under limited supervision. To our knowledge, O2O is the first method to fine-tune LDMs using RL with a text-to-image dataset. Experimental results show that O2O consistently outperforms both supervised and RL-based fine-tuning in low-data scenarios, achieving superior image quality.
TY - BOOKAU - Nguyen, HoaAU - Nguyen, Vinh-TiepAU - Luong, NgocAU - Nguyen, Thanh-SonPY - 2025/11/24SP - T1 - O2O: Fine-Tuning Diffusion Models with Reinforcement Learning Using a Hybrid of Generated and Real ImagesVL - ER -
For further details & full text: https://www.researchgate.net/publication/397895867_O2O_Fine-Tuning_Diffusion_Models_with_Reinforcement_Learning_Using_a_Hybrid_of_Generated_and_Real_Images
