This paper examines the exploration-exploitation trade-off in reinforcement learning with verifiable rewards (RLVR), a framework for improving the reasoning of Large Language Models (LLMs). Recent studies suggest that RLVR can elicit strong mathematical reasoning in LLMs through two seemingly paradoxical mechanisms: spurious rewards, which suppress exploitation by rewarding outcomes unrelated to the ground truth, and entropy minimization, which suppresses exploration by pushing the model toward more confident and deterministic outputs, highlighting a puzzling dynamic: both discouraging exploitation and discouraging exploration improve reasoning performance, yet the underlying principles that reconcile these effects remain poorly understood. We focus on two fundamental questions: (i) how policy entropy relates to performance, and (ii) whether spurious rewards yield gains, potentially through the interplay of clipping bias and model contamination. Our results show that clipping bias under spurious rewards reduces policy entropy, leading to more confident and deterministic outputs, while entropy minimization alone is insufficient for improvement. We further propose a reward-misalignment model explaining why spurious rewards can enhance performance beyond contaminated settings. Our findings clarify the mechanisms behind spurious-reward benefits and provide principles for more effective RLVR training.
翻译:本文研究了具有可验证奖励的强化学习(RLVR)中的探索-利用权衡问题,该框架旨在提升大型语言模型(LLMs)的推理能力。近期研究表明,RLVR可通过两种看似矛盾的机制激发LLMs强大的数学推理能力:伪奖励(通过奖励与真实答案无关的结果来抑制利用)和熵最小化(通过推动模型产生更自信、更确定的输出来抑制探索)。这揭示了一个令人困惑的动态:既抑制利用又抑制探索均能提升推理性能,然而调和这些效应的内在原理仍不甚明晰。我们聚焦于两个基本问题:(i)策略熵如何与性能相关联;(ii)伪奖励是否通过裁剪偏差与模型污染之间的相互作用产生增益。我们的结果表明,伪奖励下的裁剪偏差降低了策略熵,从而产生更自信、更确定的输出,而仅靠熵最小化不足以带来改进。我们进一步提出了一个奖励失配模型,以解释为何伪奖励能在污染环境之外提升性能。这些发现阐明了伪奖励获益背后的机制,并为更有效的RLVR训练提供了原则。