Reinforcement Learning with Verifiable Rewards (RLVR) has propelled Large Language Models in complex reasoning, yet its scalability is often hindered by a training bottleneck where performance plateaus as policy entropy collapses, signaling a loss of exploration. Previous methods typically address this by maintaining high policy entropy, yet the precise mechanisms that govern meaningful exploration have remained underexplored. Our analysis suggests that an unselective focus on entropy risks amplifying irrelevant tokens and destabilizing training. This paper investigates the exploration dynamics within RLVR and identifies a key issue: the gradual elimination of valuable low-probability exploratory tokens, which we term \textbf{\textit{reasoning sparks}}. We find that while abundant in pre-trained models, these sparks are systematically extinguished during RLVR due to over-penalization, leading to a degeneracy in exploration. To address this, we introduce Low-probability Regularization (Lp-Reg). Its core mechanism regularizes the policy towards a heuristic proxy distribution. This proxy is constructed by filtering out presumed noise tokens and re-normalizing the distribution over the remaining candidates. The result is a less-noisy proxy where the probability of \textit{reasoning sparks} is amplified, which then serves as a soft regularization target to shield these valuable tokens from elimination via KL divergence. Experiments show that Lp-Reg enables stable on-policy RL, sustaining continuous scaling across $3,000$ training steps and $81,204$ GPU-hours, where baseline entropy-control methods collapse. This sustained exploration leads to state-of-the-art performance, achieving a $60.17\%$ average accuracy on five math benchmarks, an improvement of $2.66\%$ over prior methods. Code is available at https://github.com/CarlanLark/Lp-Reg.
翻译:可验证奖励强化学习(RLVR)推动了大型语言模型在复杂推理任务中的应用,但其可扩展性常受限于训练瓶颈:当策略熵崩塌导致性能停滞时,探索能力随之丧失。现有方法通常通过维持高策略熵应对此问题,然而对有效探索机制的研究仍不充分。我们的分析表明,盲目关注熵值可能放大无关令牌并破坏训练稳定性。本文深入探究RLVR中的探索动力学,发现关键问题在于:有价值的低概率探索性令牌(我们称之为**推理火花**)在训练过程中被逐步消除。研究发现,尽管预训练模型中富含此类火花,但在RLVR过程中它们因过度惩罚而被系统性湮灭,导致探索能力退化。为解决此问题,我们提出低概率正则化方法(Lp-Reg)。其核心机制通过启发式代理分布对策略进行正则化:该代理分布通过过滤预设噪声令牌并对剩余候选令牌重新归一化构建,形成低噪声代理分布以放大**推理火花**的概率,进而作为软正则化目标,通过KL散度保护这些珍贵令牌免遭消除。实验表明,Lp-Reg能实现稳定的同策略强化学习,在3,000个训练步和81,204 GPU小时的训练规模下保持持续扩展能力,而基线熵控制方法在此过程中失效。这种持续性探索带来了最先进的性能表现,在五个数学基准测试中达到60.17%的平均准确率,较现有方法提升2.66%。代码发布于https://github.com/CarlanLark/Lp-Reg。