Large language models (LLMs) solve reasoning problems by first generating a rationale and then answering. We formalize reasoning as a latent variable model and derive an expectation-maximization (EM) objective for learning to reason. This view connects EM and modern reward-based optimization, and shows that the main challenge lies in designing a sampling distribution that generates rationales that justify correct answers. We instantiate and compare several sampling schemes: rejection sampling with a budget, self-taught reasoner (STaR), and prompt posterior sampling (PPS), which only keeps the rationalization stage of STaR. Our experiments on the ARC, MMLU, and OpenBookQA datasets with the Llama and Qwen models show that the sampling scheme can significantly affect the accuracy of learned reasoning models. Despite its simplicity, we observe that PPS outperforms the other sampling schemes.
翻译:大语言模型(LLMs)通过先生成推理依据再给出答案来解决推理问题。本文将推理形式化为一个隐变量模型,并推导出用于学习推理的期望最大化(EM)目标。这一视角将EM与现代基于奖励的优化方法联系起来,并表明主要挑战在于设计一种能够生成证明正确答案的推理依据的采样分布。我们实现并比较了多种采样方案:带预算的拒绝采样、自教导推理器(STaR)以及仅保留STaR中推理阶段的提示后验采样(PPS)。我们在ARC、MMLU和OpenBookQA数据集上使用Llama和Qwen模型进行的实验表明,采样方案能显著影响所学推理模型的准确性。尽管方法简单,我们观察到PPS优于其他采样方案。