Large language models (LLMs) have achieved remarkable progress in various natural language processing tasks with emergent abilities. However, they face inherent limitations, such as an inability to access up-to-date information, utilize external tools, or perform precise mathematical reasoning. In this paper, we introduce Chameleon, a plug-and-play compositional reasoning framework that augments LLMs to help address these challenges. Chameleon synthesizes programs to compose various tools, including LLM models, off-the-shelf vision models, web search engines, Python functions, and rule-based modules tailored to user interests. Built on top of an LLM as a natural language planner, Chameleon infers the appropriate sequence of tools to compose and execute in order to generate a final response. We showcase the adaptability and effectiveness of Chameleon on two tasks: ScienceQA and TabMWP. Notably, Chameleon with GPT-4 achieves an 86.54% accuracy on ScienceQA, significantly improving upon the best published few-shot model by 11.37%; using GPT-4 as the underlying LLM, Chameleon achieves a 17.8% increase over the state-of-the-art model, leading to a 98.78% overall accuracy on TabMWP. Further studies suggest that using GPT-4 as a planner exhibits more consistent and rational tool selection and is able to infer potential constraints given the instructions, compared to other LLMs like ChatGPT.
翻译:大型语言模型(LLM)在各种自然语言处理任务中取得了显着进展,并具有新兴的能力。然而,它们面临固有的限制,如无法访问最新信息、利用外部工具或执行精确的数学推理。在本文中,我们介绍了变色龙,这是一个即插即用的组合推理框架,它增强了LLM以帮助解决这些挑战。变色龙综合各种工具,包括LLM模型、现成的视觉模型、网络搜索引擎、Python函数和针对用户兴趣量身定制的基于规则的模块。基于LLM作为自然语言计划器,变色龙推断出合适的工具序列以组合和执行,生成最终的响应。我们展示了变色龙在两个任务中的适应性和有效性: ScienceQA和TabMWP。值得注意的是,使用GPT-4的变色龙在ScienceQA上实现了86.54%的准确率,比最好的发布的少量训练模型提高了11.37%;使用GPT-4作为底层LLM,变色龙在TabMWP上实现了17.8%的增长,导致98.78%的整体准确度。进一步的研究表明,使用GPT-4作为计划器表现出更一致和合理的工具选择,并能够推断出可能的约束条件,给出指令,相比于其他LLM,如ChatGPT更加优秀。