Reinforcement learning with verifiable rewards (RLVR) has become a practical route to improve large language model reasoning, and Group Relative Policy Optimization (GRPO) is a widely used optimizer in this setting. This paper revisits GRPO from a generalization perspective. Recent analysis shows that population performance can be controlled by a robust empirical objective that decomposes into the training loss plus a sharpness term measured by the gradient norm. We develop a token-level view of this sharpness term and show that GRPO can be dominated by a small subset of tokens with disproportionately large per-token gradients, which increases sharpness and can harm generalization. Motivated by this view, we propose Token-Regulated GRPO (TR-GRPO), which introduces a monotone probability shaping function to assign token weights based on the model's own token probabilities, and integrates these weights into the standard GRPO. Our analysis yields a bound that isolates a probability dependent multiplicative factor in token-gradient magnitudes, explaining how probability-aware weighting suppresses sharp directions while preserving learning signal on semantically critical tokens. Experiments on logic puzzles, mathematical reasoning, and tool-augmented question answering show consistent improvements over GRPO, along with smoother gradient-norm trajectories, supporting TR-GRPO as a simple and effective generalization-oriented upgrade to GRPO for RLVR.
翻译:具有可验证奖励的强化学习已成为改进大语言模型推理能力的实用途径,而组相对策略优化是该设定下广泛使用的优化器。本文从泛化视角重新审视GRPO。近期分析表明,群体性能可通过一个稳健的经验目标函数控制,该函数可分解为训练损失加上由梯度范数衡量的锐度项。我们建立了该锐度项的词元级视角,并证明GRPO可能被一小部分具有不成比例大单词元梯度的词元所主导,这会增加锐度并损害泛化能力。受此观点启发,我们提出词元调控的GRPO,该方法引入单调概率整形函数,根据模型自身的词元概率分配词元权重,并将这些权重整合到标准GRPO中。我们的分析推导出一个界限,该界限分离了词元梯度幅度中与概率相关的乘法因子,从而解释了概率感知加权如何在抑制尖锐方向的同时,保留对语义关键词元的学习信号。在逻辑谜题、数学推理和工具增强问答上的实验表明,相较于GRPO,该方法取得了持续改进,同时梯度范数轨迹更为平滑,这支持TR-GRPO作为RLVR中面向泛化的简单有效的GRPO升级方案。