Large language models (LLMs) with billions of parameters exhibit in-context learning abilities, enabling few-shot learning on tasks that the model was not specifically trained for. Traditional models achieve breakthrough performance on language tasks, but do not perform well on basic reasoning benchmarks. However, a new in-context learning approach, Chain-of-thought, has demonstrated strong multi-step reasoning abilities on these benchmarks. The research on LLM reasoning abilities started with the question whether LLMs can solve grade school math word problems, and has expanded to other tasks in the past few years. This article reviews the field of multi-step reasoning with LLMs. We propose a taxonomy that identifies different ways to generate, evaluate, and control multi-step reasoning. We provide an in-depth coverage of core approaches and open problems, and we propose a research agenda for the near future. We find that multi-step reasoning approaches have progressed beyond math word problems, and can now successfully solve challenges in logic, combinatorial games, and robotics, sometimes by first generating code that is then executed by external tools. Many studies in multi-step methods use reinforcement learning for finetuning, external optimization loops, in-context reinforcement learning, and self-reflection.
翻译:具有数十亿参数的大语言模型(LLMs)展现出上下文学习能力,使其能够在未经专门训练的任务上实现少样本学习。传统模型在语言任务上取得了突破性表现,但在基础推理基准测试中表现不佳。然而,一种新的上下文学习方法——思维链(Chain-of-thought)——已在这些基准测试中展现出强大的多步推理能力。关于LLM推理能力的研究始于探讨LLM能否解决小学数学应用题,并在过去几年扩展到其他任务领域。本文综述了基于LLM的多步推理研究领域。我们提出了一个分类体系,用以识别生成、评估和控制多步推理的不同方法。我们深入探讨了核心研究路径与开放性问题,并为近期研究提出了议程规划。研究发现,多步推理方法已超越数学应用题范畴,目前能成功解决逻辑推理、组合博弈和机器人学等领域的挑战,有时通过先生成代码再借助外部工具执行的方式实现。许多多步推理研究采用强化学习进行微调,结合外部优化循环、上下文强化学习及自我反思机制。