Frontier reasoning models have exhibited incredible capabilities across a wide array of disciplines, driven by posttraining large language models (LLMs) with reinforcement learning (RL). However, despite the widespread success of this paradigm, much of the literature has been devoted to disentangling truly novel behaviors that emerge during RL but are not present in the base models. In our work, we approach this question from a different angle, instead asking whether comparable reasoning capabilites can be elicited from base models at inference time by pure sampling, without any additional training. Inspired by Markov chain Monte Carlo (MCMC) techniques for sampling from sharpened distributions, we propose a simple iterative sampling algorithm leveraging the base models' own likelihoods. Over different base models, we show that our algorithm offers substantial boosts in reasoning that nearly match and even outperform those from RL on a wide variety of single-shot tasks, including MATH500, HumanEval, and GPQA. Moreover, our sampler avoids the collapse in diversity over multiple samples that is characteristic of RL-posttraining. Crucially, our method does not require training, curated datasets, or a verifier, suggesting broad applicability beyond easily verifiable domains.
翻译:前沿推理模型在强化学习(RL)对大型语言模型(LLM)进行后训练的驱动下,已在众多学科领域展现出惊人的能力。然而,尽管这一范式取得了广泛成功,现有文献大多致力于厘清那些在RL过程中出现、但基础模型本身并不具备的真正新行为。在本研究中,我们从一个不同的角度探讨这一问题:我们追问,是否能够仅通过纯采样在推理阶段从基础模型中激发出可比的推理能力,而无需任何额外训练。受从锐化分布中采样的马尔可夫链蒙特卡洛(MCMC)技术启发,我们提出了一种利用基础模型自身似然的简单迭代采样算法。在不同基础模型上的实验表明,我们的算法在多种单次任务(包括MATH500、HumanEval和GPQA)上带来了显著的推理能力提升,其效果近乎匹配甚至超越RL后训练的结果。此外,我们的采样器避免了RL后训练在多样本上常见的多样性崩溃问题。关键在于,我们的方法无需训练、精选数据集或验证器,这表明其可广泛应用于易于验证领域之外的场景。