This paper proposes a novel formulation for reinforcement learning (RL) with large language models, explaining why and under what conditions the true sequence-level reward can be optimized via a surrogate token-level objective in policy gradient methods such as REINFORCE. Specifically, through a first-order approximation, we show that this surrogate becomes increasingly valid only when both the training-inference discrepancy and policy staleness are minimized. This insight provides a principled explanation for the crucial role of several widely adopted techniques in stabilizing RL training, including importance sampling correction, clipping, and particularly Routing Replay for Mixture-of-Experts (MoE) models. Through extensive experiments with a 30B MoE model totaling hundreds of thousands of GPU hours, we show that for on-policy training, the basic policy gradient algorithm with importance sampling correction achieves the highest training stability. When off-policy updates are introduced to accelerate convergence, combining clipping and Routing Replay becomes essential to mitigate the instability caused by policy staleness. Notably, once training is stabilized, prolonged optimization consistently yields comparable final performance regardless of cold-start initialization. We hope that the shared insights and the developed recipes for stable RL training will facilitate future research.
翻译:本文提出了一种基于大型语言模型的强化学习新框架,阐释了在何种条件下,通过REINFORCE等策略梯度方法中的代理词级目标函数能够优化真实的序列级奖励。具体而言,通过一阶近似分析,我们证明仅当训练-推断差异与策略陈旧性同时最小化时,该代理目标才逐渐有效。这一发现为多项广泛采用的稳定强化学习训练技术提供了理论依据,包括重要性采样修正、梯度裁剪,以及特别针对专家混合模型的路由回放机制。通过对总计数十万GPU小时的30B专家混合模型进行大量实验,我们发现在同策略训练中,采用重要性采样修正的基础策略梯度算法能实现最高的训练稳定性。当引入异策略更新以加速收敛时,结合梯度裁剪与路由回放对于缓解策略陈旧性引起的不稳定性至关重要。值得注意的是,一旦训练趋于稳定,无论冷启动初始化方式如何,延长优化时间均能获得可比的最终性能。我们希望所分享的理论见解与开发的稳定强化学习训练方案能为未来研究提供参考。