Test-time scaling (TTS) has proven effective in enhancing the reasoning capabilities of large language models (LLMs). Verification plays a key role in TTS, simultaneously influencing (1) reasoning performance and (2) compute efficiency, due to the quality and computational cost of verification. In this work, we challenge the conventional paradigms of verification, and make the first attempt toward systematically investigating the impact of verification granularity-that is, how frequently the verifier is invoked during generation, beyond verifying only the final output or individual generation steps. To this end, we introduce Variable Granularity Search (VG-Search), a unified algorithm that generalizes beam search and Best-of-N sampling via a tunable granularity parameter g. Extensive experiments with VG-Search under varying compute budgets, generator-verifier configurations, and task attributes reveal that dynamically selecting g can improve the compute efficiency and scaling behavior. Building on these findings, we propose adaptive VG-Search strategies that achieve accuracy gains of up to 3.1\% over Beam Search and 3.6\% over Best-of-N, while reducing FLOPs by over 52\%. We will open-source the code to support future research.
翻译:测试时扩展(TTS)已被证明能有效增强大语言模型(LLMs)的推理能力。验证在TTS中起着关键作用,同时影响(1)推理性能和(2)计算效率,这源于验证的质量和计算成本。在本工作中,我们挑战了传统的验证范式,并首次尝试系统研究验证粒度的影响——即验证器在生成过程中被调用的频率,而不仅限于验证最终输出或单个生成步骤。为此,我们引入了可变粒度搜索(VG-Search),这是一种通过可调粒度参数g统一化束搜索和最佳N采样(Best-of-N sampling)的算法。在不同计算预算、生成器-验证器配置和任务属性下对VG-Search进行的大量实验表明,动态选择g可以提升计算效率和扩展行为。基于这些发现,我们提出了自适应VG-Search策略,在比束搜索(Beam Search)准确率提升高达3.1%、比最佳N采样提升3.6%的同时,将FLOPs降低超过52%。我们将开源代码以支持未来研究。