Recent research has shown that Large Language Models (LLMs) can utilize external tools to improve their contextual processing abilities, moving away from the pure language modeling paradigm and paving the way for Artificial General Intelligence. Despite this, there has been a lack of systematic evaluation to demonstrate the efficacy of LLMs using tools to respond to human instructions. This paper presents API-Bank, the first benchmark tailored for Tool-Augmented LLMs. API-Bank includes 53 commonly used API tools, a complete Tool-Augmented LLM workflow, and 264 annotated dialogues that encompass a total of 568 API calls. These resources have been designed to thoroughly evaluate LLMs' ability to plan step-by-step API calls, retrieve relevant APIs, and correctly execute API calls to meet human needs. The experimental results show that GPT-3.5 emerges the ability to use the tools relative to GPT3, while GPT-4 has stronger planning performance. Nevertheless, there remains considerable scope for further improvement when compared to human performance. Additionally, detailed error analysis and case studies demonstrate the feasibility of Tool-Augmented LLMs for daily use, as well as the primary challenges that future research needs to address.
翻译:近期的研究表明,大型语言模型(LLM)可以利用外部工具来提高其上下文处理能力,摆脱了纯语言建模的范式,为人工通用智能铺平了道路。然而,缺乏系统的评估来证明使用工具响应人类指令时LLM的有效性。本文提出了第一个为工具辅助的LLM量身定制的基准——API-Bank。API-Bank包括53个常用的API工具、完整的工具辅助LLM工作流程和264个注释对话,涵盖了总共568个API调用。这些资源旨在全面评估LLM规划逐步API调用、检索相关API并正确执行API调用以满足人类需求的能力。实验结果表明,相较于GPT-3,GPT-3.5更具有使用工具的能力,而GPT-4的规划性能更强。但是相对于人类表现,仍有相当大的提升空间。此外,详细的错误分析和案例研究证明了工具辅助的LLM的日常使用的可行性,以及未来研究需要解决的主要挑战。