Recently, a new wave of thinking-capable Large Language Models has emerged, demonstrating exceptional capabilities across a wide range of reasoning benchmarks. Early studies have begun to explore how the amount of compute in terms of the length of the reasoning process, the so-called thinking budget, impacts model performance. In this work, we propose a systematic investigation of the thinking budget as a key parameter, examining its interaction with various configurations such as self-consistency, reflection, and others. Our goal is to provide an informative, balanced comparison framework that considers both performance outcomes and computational cost. Among our findings, we discovered that simply increasing the thinking budget is not the most effective use of compute. More accurate responses can instead be achieved through alternative configurations, such as self-consistency and self-reflection.
翻译:近期,新一代具备思维能力的大型语言模型崭露头角,在各类推理基准测试中展现出卓越性能。早期研究已开始探索推理过程长度所对应的计算量——即所谓"思考预算"——如何影响模型表现。本研究提出对思考预算作为关键参数进行系统性探究,考察其与自洽性、反思机制等多种配置的交互作用。我们的目标是建立一个兼顾性能结果与计算成本的综合性比较框架。研究发现:单纯增加思考预算并非最有效的计算资源利用方式。通过自洽性检验与自我反思等替代配置,反而能获得更准确的响应结果。