Reinforcement Learning with Verifiable Rewards (RLVR) significantly enhances the reasoning capability of Large Language Models (LLMs). Current RLVR approaches typically conduct training across all generated tokens, but neglect to explore which tokens (e.g., prefix tokens) actually contribute to reasoning. This uniform training strategy spends substantial effort on optimizing low-return tokens, which in turn impedes the potential improvement from high-return tokens and reduces overall training effectiveness. To address this issue, we propose a novel RLVR approach called Progressive Prefix-token Policy Optimization (PPPO), which highlights the significance of the prefix segment of generated outputs. Specifically, inspired by the well-established human thinking theory of Path Dependence, where early-stage thoughts substantially constrain subsequent thinking trajectory, we identify an analogous phenomenon in LLM reasoning termed Beginning Lock-in Effect (BLE). PPPO leverages this finding by focusing its optimization objective on the prefix reasoning process of LLMs. This targeted optimization strategy can positively influence subsequent reasoning processes, and ultimately improve final results. To improve the learning effectiveness of LLMs on how to start reasoning with high quality, PPPO introduces two training strategies: (a) Progressive Prefix Retention, which shapes a progressive learning process by increasing the proportion of retained prefix tokens during training; (b) Continuation Accumulated Reward, which mitigates reward bias by sampling multiple continuations for one prefix token sequence, and accumulating their scores as the reward signal. Extensive experimental results on various reasoning tasks demonstrate that our proposed PPPO outperforms representative RLVR methods, with the accuracy improvements of 18.02% on only 26.17% training tokens.
翻译:基于可验证奖励的强化学习(RLVR)显著提升了大语言模型(LLMs)的推理能力。当前的RLVR方法通常对所有生成的词元进行训练,但忽略了探究哪些词元(例如前缀词元)实际对推理过程有贡献。这种均匀的训练策略将大量精力耗费在优化低回报词元上,这反过来阻碍了高回报词元带来的潜在改进,并降低了整体训练效率。为解决此问题,我们提出了一种新颖的RLVR方法,称为渐进式前缀词元策略优化(PPPO),该方法强调了生成输出中前缀部分的重要性。具体而言,受人类思维中路径依赖这一成熟理论的启发——早期思维极大地制约了后续的思考轨迹,我们在LLM推理中识别出一种类似的现象,称为开端锁定效应(BLE)。PPPO利用这一发现,将其优化目标聚焦于LLMs的前缀推理过程。这种有针对性的优化策略能够对后续推理过程产生积极影响,并最终改善最终结果。为了提高LLMs在如何高质量启动推理方面的学习效果,PPPO引入了两种训练策略:(a)渐进式前缀保留,通过增加训练过程中保留的前缀词元比例,塑造一个渐进式的学习过程;(b)延续累积奖励,通过对一个前缀词元序列采样多个延续,并累积其得分作为奖励信号,以减轻奖励偏差。在多种推理任务上的大量实验结果表明,我们提出的PPPO方法优于代表性的RLVR方法,在仅使用26.17%的训练词元时,准确率提升了18.02%。