Reinforcement Learning with Verifiable Rewards (RLVR) is a powerful framework for improving the reasoning abilities of Large Language Models (LLMs). However, current methods such as GRPO rely only on problems where the model responses to the same input differ in correctness, while ignoring those where all responses receive the same reward -- so-called zero-variance prompts. In this work, we argue that such prompts are not useless but can, in fact, provide meaningful feedback for policy optimization. To this end, we introduce RL with Zero-Variance Prompts (RL-ZVP), a novel algorithm that extract learning signals from zero-variance prompts. RL-ZVP directly rewards correctness and penalizes errors even without contrasting responses, modulating feedback with token-level characteristics to preserve informative, nuanced signals. Across six math reasoning benchmarks, RL-ZVP achieves significant improvements of up to 8.61 points in accuracy and 7.77 points in pass rate over GRPO, while consistently outperforming other baselines that filter out zero-variance prompts. These results highlight the untapped potential of learning from zero-variance prompts in RLVR.
翻译:可验证奖励的强化学习(RLVR)是提升大型语言模型(LLM)推理能力的强大框架。然而,当前方法如GRPO仅依赖于模型对同一输入产生不同正确性响应的任务,而忽略了所有响应获得相同奖励的情况——即所谓的零方差提示。本文认为,此类提示并非无用,事实上能为策略优化提供有意义的反馈。为此,我们引入零方差提示强化学习(RL-ZVP),一种从零方差提示中提取学习信号的新算法。RL-ZVP即使在没有对比响应的情况下也能直接奖励正确性并惩罚错误,同时通过词元级特征调制反馈以保留信息丰富、细致入微的信号。在六个数学推理基准测试中,RL-ZVP相比GRPO在准确率上最高提升8.61个百分点,通过率最高提升7.77个百分点,且始终优于其他过滤零方差提示的基线方法。这些结果凸显了在RLVR框架中从零方差提示学习的未开发潜力。