从优化视角修正大语言模型的思维过程 (Rectifying LLM Thought from Lens of Optimization)

Recent advancements in large language models (LLMs) have been driven by their emergent reasoning capabilities, particularly through long chain-of-thought (CoT) prompting, which enables thorough exploration and deliberation. Despite these advances, long-CoT LLMs often exhibit suboptimal reasoning behaviors, such as overthinking and excessively protracted reasoning chains, which can impair performance. In this paper, we analyze reasoning processes through an optimization lens, framing CoT as a gradient descent procedure where each reasoning step constitutes an update toward problem resolution. Building on this perspective, we introduce RePro (Rectifying Process-level Reward), a novel approach to refine LLM reasoning during post-training. RePro defines a surrogate objective function to assess the optimization process underlying CoT, utilizing a dual scoring mechanism to quantify its intensity and stability. These scores are aggregated into a composite process-level reward, seamlessly integrated into reinforcement learning with verifiable rewards (RLVR) pipelines to optimize LLMs. Extensive experiments across multiple reinforcement learning algorithms and diverse LLMs, evaluated on benchmarks spanning mathematics, science, and coding, demonstrate that RePro consistently enhances reasoning performance and mitigates suboptimal reasoning behaviors.

翻译：大语言模型（LLMs）的最新进展主要得益于其涌现的推理能力，特别是通过长链思维（CoT）提示，实现了深入的探索与思考。尽管取得了这些进步，长链CoT LLMs仍常表现出次优的推理行为，如过度思考与推理链过长，这可能损害模型性能。本文从优化视角分析推理过程，将CoT框架为梯度下降过程，其中每个推理步骤构成朝向问题解决的更新。基于这一视角，我们提出了RePro（修正过程级奖励），一种在训练后阶段精炼LLM推理的新方法。RePro定义了一个代理目标函数来评估CoT背后的优化过程，利用双重评分机制量化其强度与稳定性。这些分数被聚合为一个复合的过程级奖励，无缝集成到带有可验证奖励的强化学习（RLVR）流程中，以优化LLMs。在涵盖数学、科学与编程的多个基准测试上，通过多种强化学习算法与多样化LLMs的广泛实验表明，RePro能持续提升推理性能并有效缓解次优推理行为。