This paper proposes a novel formulation for reinforcement learning (RL) with large language models, explaining why and under what conditions the true sequence-level reward can be optimized via a surrogate token-level objective in policy gradient methods such as REINFORCE. Specifically, through a first-order approximation, we show that this surrogate becomes increasingly valid only when both the training-inference discrepancy and policy staleness are minimized. This insight provides a principled explanation for the crucial role of several widely adopted techniques in stabilizing RL training, including importance sampling correction, clipping, and particularly Routing Replay for Mixture-of-Experts (MoE) models. Through extensive experiments with a 30B MoE model totaling hundreds of thousands of GPU hours, we show that for on-policy training, the basic policy gradient algorithm with importance sampling correction achieves the highest training stability. When off-policy updates are introduced to accelerate convergence, combining clipping and Routing Replay becomes essential to mitigate the instability caused by policy staleness. Notably, once training is stabilized, prolonged optimization consistently yields comparable final performance regardless of cold-start initialization. We hope that the shared insights and the developed recipes for stable RL training will facilitate future research.
翻译:本文提出了一种基于大语言模型的强化学习新框架,阐释了在何种条件下以及为何能够通过REINFORCE等策略梯度方法中的代理词级目标函数来优化真实的序列级奖励。具体而言,通过一阶近似分析,我们证明该代理目标的有效性仅在训练-推断差异与策略陈旧性均被最小化时逐渐成立。这一发现为多项广泛采用的强化学习稳定化技术提供了理论依据,包括重要性采样校正、梯度裁剪,以及特别针对专家混合模型的路由回放机制。通过对总计数十万GPU小时的30B参数专家混合模型进行大量实验,我们证明在在线策略训练中,采用重要性采样校正的基础策略梯度算法能实现最高的训练稳定性。当引入离线策略更新以加速收敛时,结合梯度裁剪与路由回放对于缓解策略陈旧性引起的不稳定性至关重要。值得注意的是,一旦训练过程趋于稳定,无论采用何种冷启动初始化方式,延长优化时间均能持续获得可比拟的最终性能。我们希望所分享的理论洞见与开发的稳定强化学习训练方案能为未来研究提供助力。