Aligning generative diffusion models with human preferences via reinforcement learning (RL) is critical yet challenging. Most existing algorithms are often vulnerable to reward hacking, such as quality degradation, over-stylization, or reduced diversity. Our analysis demonstrates that this can be attributed to the inherent limitations of their regularization, which provides unreliable penalties. We introduce Data-regularized Diffusion Reinforcement Learning (DDRL), a novel framework that uses the forward KL divergence to anchor the policy to an off-policy data distribution. Theoretically, DDRL enables robust, unbiased integration of RL with standard diffusion training. Empirically, this translates into a simple yet effective algorithm that combines reward maximization with diffusion loss minimization. With over a million GPU hours of experiments and ten thousand double-blind human evaluations, we demonstrate on high-resolution video generation tasks that DDRL significantly improves rewards while alleviating the reward hacking seen in baselines, achieving the highest human preference and establishing a robust and scalable paradigm for diffusion post-training.
翻译:通过强化学习(RL)将生成式扩散模型与人类偏好对齐至关重要,但也充满挑战。现有算法大多容易受到奖励欺骗的影响,例如质量下降、过度风格化或多样性降低。我们的分析表明,这可以归因于其正则化方法的固有局限性,即提供了不可靠的惩罚。我们引入了数据正则化扩散强化学习(DDRL),这是一个新颖的框架,它使用前向KL散度将策略锚定在一个离策略的数据分布上。理论上,DDRL实现了强化学习与标准扩散训练的鲁棒、无偏集成。实证上,这转化为一个简单而有效的算法,将奖励最大化与扩散损失最小化相结合。通过超过一百万GPU小时的实验和一万次双盲人类评估,我们在高分辨率视频生成任务上证明,DDRL显著提升了奖励,同时缓解了基线模型中出现的奖励欺骗问题,获得了最高的人类偏好评分,并为扩散模型后训练建立了一个鲁棒且可扩展的范式。