G$^2$RPO：用于流模型精确奖励的细粒度GRPO (G$^2$RPO: Granular GRPO for Precise Reward in Flow Models)

The integration of online reinforcement learning (RL) into diffusion and flow models has recently emerged as a promising approach for aligning generative models with human preferences. Stochastic sampling via Stochastic Differential Equations (SDE) is employed during the denoising process to generate diverse denoising directions for RL exploration. While existing methods effectively explore potential high-value samples, they suffer from sub-optimal preference alignment due to sparse and narrow reward signals. To address these challenges, we propose a novel Granular-GRPO (G$^2$RPO) framework that achieves precise and comprehensive reward assessments of sampling directions in reinforcement learning of flow models. Specifically, a Singular Stochastic Sampling strategy is introduced to support step-wise stochastic exploration while enforcing a high correlation between the reward and the injected noise, thereby facilitating a faithful reward for each SDE perturbation. Concurrently, to eliminate the bias inherent in fixed-granularity denoising, we introduce a Multi-Granularity Advantage Integration module that aggregates advantages computed at multiple diffusion scales, producing a more comprehensive and robust evaluation of the sampling directions. Experiments conducted on various reward models, including both in-domain and out-of-domain evaluations, demonstrate that our G$^2$RPO significantly outperforms existing flow-based GRPO baselines,highlighting its effectiveness and robustness.

翻译：将在线强化学习（RL）融入扩散模型和流模型，最近已成为使生成模型与人类偏好对齐的一种有前景的方法。在去噪过程中，通过随机微分方程（SDE）进行随机采样，为RL探索生成多样化的去噪方向。虽然现有方法能有效探索潜在的高价值样本，但由于奖励信号稀疏且范围狭窄，其偏好对齐效果欠佳。为解决这些问题，我们提出了一种新颖的细粒度GRPO（G$^2$RPO）框架，该框架能在流模型的强化学习中实现对采样方向精确且全面的奖励评估。具体而言，我们引入了一种奇异随机采样策略，以支持逐步随机探索，同时强制奖励与注入噪声之间的高度相关性，从而为每个SDE扰动提供可靠的奖励。同时，为消除固定粒度去噪中固有的偏差，我们引入了一个多粒度优势集成模块，该模块聚合了在多个扩散尺度上计算的优势，从而对采样方向产生更全面、更稳健的评估。在各种奖励模型（包括领域内和领域外评估）上进行的实验表明，我们的G$^2$RPO显著优于现有的基于流的GRPO基线，凸显了其有效性和鲁棒性。

相关内容

MoDELS

关注 44

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日

Connections between Support Vector Machines, Wasserstein distance and gradient-penalty GANs

专知会员服务

36+阅读 · 2019年10月17日

Keras François Chollet 《Deep Learning with Python 》, 386页pdf

专知会员服务

163+阅读 · 2019年10月12日