Inference-time scaling has emerged as a powerful technique for enhancing the reasoning performance of Large Language Models (LLMs). However, existing approaches often rely on heuristic strategies for parallel sampling, lacking a principled foundation. To address this gap, we propose a probabilistic framework that formalizes the optimality of inference-time scaling under the assumption that parallel samples are independently and identically distributed (i.i.d.), and where the Best-of-$N$ selection strategy follows a probability distribution that can be estimated. Within this framework, we derive a theoretical lower bound on the required number of samples to achieve a target performance level, providing the first principled guidance for compute-efficient scaling. Leveraging this insight, we develop \textsc{OptScale}, a practical algorithm that dynamically determines the optimal number of sampled responses. \textsc{OptScale} employs a language model-based predictor to estimate probabilistic prior parameters, enabling the decision of the minimal number of samples needed that satisfy predefined performance thresholds and confidence levels. Extensive experiments on representative reasoning benchmarks (including MATH-500, GSM8K, AIME, and AMC) demonstrate that \textsc{OptScale} significantly reduces sampling overhead while remaining better or on par with state-of-the-art reasoning performance. Our work offers both a theoretical foundation and a practical solution for principled inference-time scaling, addressing a critical gap in the efficient deployment of LLMs for complex reasoning.
翻译:推理时扩展已成为提升大型语言模型(LLMs)推理性能的强大技术。然而,现有方法通常依赖于启发式的并行采样策略,缺乏理论原则基础。为弥补这一空白,我们提出一个概率框架,该框架在并行样本独立同分布(i.i.d.)的假设下形式化了推理时扩展的最优性,并且其中最佳-$N$选择策略遵循一个可估计的概率分布。在此框架内,我们推导出达到目标性能水平所需样本数量的理论下界,为计算高效的扩展提供了首个基于原则的指导。基于这一洞见,我们开发了 \textsc{OptScale},一种动态确定最优采样响应数量的实用算法。\textsc{OptScale} 采用基于语言模型的预测器来估计概率先验参数,从而能够决策满足预设性能阈值和置信水平所需的最小样本数量。在代表性推理基准(包括 MATH-500、GSM8K、AIME 和 AMC)上的大量实验表明,\textsc{OptScale} 在保持优于或与最先进推理性能持平的同时,显著降低了采样开销。我们的工作为原则性的推理时扩展提供了理论基础和实用解决方案,解决了高效部署 LLMs 进行复杂推理的一个关键空白。