Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as an effective approach for improving the reasoning abilities of large language models (LLMs). The Group Relative Policy Optimization (GRPO) family has demonstrated strong performance in training LLMs with RLVR. However, as models train longer and scale larger, more training prompts become residual prompts, those with zero variance rewards that provide no training signal. Consequently, fewer prompts contribute to training, reducing diversity and hindering effectiveness. To fully exploit these residual prompts, we propose the Explore Residual Prompts in Policy Optimization (ERPO) framework, which encourages exploration on residual prompts and reactivates their training signals. ERPO maintains a history tracker for each prompt and adaptively increases the sampling temperature for residual prompts that previously produced all correct responses. This encourages the model to generate more diverse reasoning traces, introducing incorrect responses that revive training signals. Empirical results on the Qwen2.5 series demonstrate that ERPO consistently surpasses strong baselines across multiple mathematical reasoning benchmarks.
翻译:基于可验证奖励的强化学习(RLVR)已成为提升大语言模型(LLMs)推理能力的有效方法。其中,组相对策略优化(GRPO)系列方法在利用RLVR训练LLMs方面表现出色。然而,随着模型训练时间延长和规模扩大,越来越多的训练提示变为残差提示——即奖励方差为零、无法提供训练信号的提示。这导致参与训练的提示数量减少,降低了数据多样性并削弱了训练效果。为充分利用这些残差提示,我们提出了探索残差提示的策略优化(ERPO)框架,该框架鼓励对残差提示进行探索并重新激活其训练信号。ERPO为每个提示维护历史追踪器,并自适应地提高那些先前产生全部正确响应的残差提示的采样温度。这促使模型生成更多样化的推理轨迹,通过引入错误响应来恢复训练信号。在Qwen2.5系列模型上的实证结果表明,ERPO在多个数学推理基准测试中持续超越现有强基线方法。