Large reasoning models (LRMs) often consume excessive tokens, inflating computational cost and latency. We challenge the assumption that longer responses improve accuracy. By penalizing reasoning tokens using a discounted reinforcement learning setup (interpretable as a small token cost) and analyzing Blackwell optimality in restricted policy classes, we encourage concise yet accurate reasoning. Experiments confirm our theoretical results that this approach shortens chains of thought while preserving accuracy.
翻译:大型推理模型(LRMs)通常消耗过多令牌,导致计算成本和延迟上升。本文挑战了“更长响应提升准确性”的假设。通过采用折扣强化学习框架(可解释为微小的令牌成本)对推理令牌施加惩罚,并在受限策略类中分析布莱克韦尔最优性,我们实现了简洁而准确的推理。实验证实了理论结论:该方法能在保持准确性的同时缩短思维链。