The integration of online reinforcement learning (RL) into diffusion and flow models has recently emerged as a promising approach for aligning generative models with human preferences. Stochastic sampling via Stochastic Differential Equations (SDE) is employed during the denoising process to generate diverse denoising directions for RL exploration. While existing methods effectively explore potential high-value samples, they suffer from sub-optimal preference alignment due to sparse and narrow reward signals. To address these challenges, we propose a novel Granular-GRPO (G$^2$RPO) framework that achieves precise and comprehensive reward assessments of sampling directions in reinforcement learning of flow models. Specifically, a Singular Stochastic Sampling strategy is introduced to support step-wise stochastic exploration while enforcing a high correlation between the reward and the injected noise, thereby facilitating a faithful reward for each SDE perturbation. Concurrently, to eliminate the bias inherent in fixed-granularity denoising, we introduce a Multi-Granularity Advantage Integration module that aggregates advantages computed at multiple diffusion scales, producing a more comprehensive and robust evaluation of the sampling directions. Experiments conducted on various reward models, including both in-domain and out-of-domain evaluations, demonstrate that our G$^2$RPO significantly outperforms existing flow-based GRPO baselines,highlighting its effectiveness and robustness.
翻译:将在线强化学习(RL)融入扩散模型和流模型,最近已成为使生成模型与人类偏好对齐的一种有前景的方法。在去噪过程中,通过随机微分方程(SDE)进行随机采样,为RL探索生成多样化的去噪方向。虽然现有方法能有效探索潜在的高价值样本,但由于奖励信号稀疏且范围狭窄,其偏好对齐效果欠佳。为解决这些问题,我们提出了一种新颖的细粒度GRPO(G$^2$RPO)框架,该框架能在流模型的强化学习中实现对采样方向精确且全面的奖励评估。具体而言,我们引入了一种奇异随机采样策略,以支持逐步随机探索,同时强制奖励与注入噪声之间的高度相关性,从而为每个SDE扰动提供可靠的奖励。同时,为消除固定粒度去噪中固有的偏差,我们引入了一个多粒度优势集成模块,该模块聚合了在多个扩散尺度上计算的优势,从而对采样方向产生更全面、更稳健的评估。在各种奖励模型(包括领域内和领域外评估)上进行的实验表明,我们的G$^2$RPO显著优于现有的基于流的GRPO基线,凸显了其有效性和鲁棒性。