Large language models (LLMs) can perform complex reasoning in few- and zero-shot settings by generating intermediate chain of thought (CoT) reasoning steps. Further, each reasoning step can rely on external tools to support computation beyond the core LLM capabilities (e.g. search/running code). Prior work on CoT prompting and tool use typically requires hand-crafting task-specific demonstrations and carefully scripted interleaving of model generations with tool use. We introduce Automatic Reasoning and Tool-use (ART), a framework that uses frozen LLMs to automatically generate intermediate reasoning steps as a program. Given a new task to solve, ART selects demonstrations of multi-step reasoning and tool use from a task library. At test time, ART seamlessly pauses generation whenever external tools are called, and integrates their output before resuming generation. ART achieves a substantial improvement over few-shot prompting and automatic CoT on unseen tasks in the BigBench and MMLU benchmarks, and matches performance of hand-crafted CoT prompts on a majority of these tasks. ART is also extensible, and makes it easy for humans to improve performance by correcting errors in task-specific programs or incorporating new tools, which we demonstrate by drastically improving performance on select tasks with minimal human intervention.
翻译:大型语言模型(LLMS)可以通过产生中间思维链(CoT)推理步骤,在少发和零发环境中进行复杂的推理。此外,每个推理步骤可以依赖外部工具来支持核心LLM能力之外的计算(例如搜索/运行代码)。以前关于COT的推动和工具使用工作通常需要手工制作任务演示和精心编程的模型代代代交配工具使用。我们引入了自动推理和工具使用(ART)框架,这个框架使用冻结的LMS自动生成中间推理步骤作为程序。鉴于需要解决的新任务,ART选择了多步推理和工具的使用演示。在测试时,ART在需要外部工具时会无缝地暂停生成,并在恢复生成之前将其输出整合。ART在大贝辛和MMLU基准的未知任务中,在几发即即即即快速和自动编程时实现了大幅改进的CT,并且将手动的COT的性能与大多数任务相匹配。ART也是可以完成的,并且使人类能够以最精确的方式改进我们的具体任务,从而大幅度地纠正任务。</s>