Large language models (LLMs) have recently demonstrated an impressive ability to perform arithmetic and symbolic reasoning tasks when provided with a few examples at test time (few-shot prompting). Much of this success can be attributed to prompting methods for reasoning, such as chain-of-thought, that employ LLMs for both understanding the problem description by decomposing it into steps, as well as solving each step of the problem. While LLMs seem to be adept at this sort of step-by-step decomposition, LLMs often make logical and arithmetic mistakes in the solution part, even when the problem is correctly decomposed. We present Program-Aided Language models (PaL): a new method that uses the LLM to understand natural language problems and generate programs as the intermediate reasoning steps, but offloads the solution step to a programmatic runtime such as a Python interpreter. With PaL, decomposing the natural language problem into runnable steps remains the only learning task for the LLM, while solving is delegated to the interpreter. We experiment with 12 reasoning tasks from BIG-Bench Hard and other benchmarks, including mathematical reasoning, symbolic reasoning, and algorithmic problems. In all these natural language reasoning tasks, generating code using an LLM and reasoning using a Python interpreter leads to more accurate results than much larger models, and we set new state-of-the-art results in all 12 benchmarks. For example, PaL using Codex achieves state-of-the-art few-shot accuracy on the GSM benchmark of math word problems when the model is allowed only a single decoding, surpassing PaLM-540B with chain-of-thought prompting by an absolute 8% .In three reasoning tasks from the BIG-Bench Hard benchmark, PaL outperforms CoT by 11%. On GSM-hard, a more challenging version of GSM that we create, PaL outperforms chain-of-thought by an absolute 40%.
翻译:大型语言模型(LLMS)最近展示了令人印象深刻的算术和符号推理任务(LLMs)能力,在测试时提供了几个例子(点火点火 ) 。 这一成功在很大程度上可以归功于快速推理方法,比如思维链,使用LLM来将问题描述分解成步骤,并解决问题的每一个步骤。LLMs似乎适应于这种逐步分解,但LLMs经常在解决方案部分中做出逻辑和算术错误,即使问题已经正确分解。我们展示了程序辅助语言模型(PaL):一种新的方法利用LLM(LM)来理解自然语言问题并生成程序作为中间推理步骤,但将解决方案的步移到像Python 解释器这样的程序运行时间。PALL(LM)将自然语言问题分解成可操作步骤,这仍然是LMM公司唯一的学习任务,而我们则被委托给解释员。我们用BIG-ch-nceral 和其他基准的12项推理学任务进行实验,包括数学推理学推算、象征推理学8级推理、算和推理算,我们用所有12个推理算结果都用SLForalxxxxx。我们用所有推理算的推理算,我们用所有的推理算和推理算的推理算和推理算的推算的12个的推算,我们12个的推算结果都用所有推算结果都用所有推算结果。