This study focuses on using large language models (LLMs) as a planner for embodied agents that can follow natural language instructions to complete complex tasks in a visually-perceived environment. The high data cost and poor sample efficiency of existing methods hinders the development of versatile agents that are capable of many tasks and can learn new tasks quickly. In this work, we propose a novel method, LLM-Planner, that harnesses the power of large language models to do few-shot planning for embodied agents. We further propose a simple but effective way to enhance LLMs with physical grounding to generate and update plans that are grounded in the current environment. Experiments on the ALFRED dataset show that our method can achieve very competitive few-shot performance: Despite using less than 0.5% of paired training data, LLM-Planner achieves competitive performance with recent baselines that are trained using the full training data. Existing methods can barely complete any task successfully under the same few-shot setting. Our work opens the door for developing versatile and sample-efficient embodied agents that can quickly learn many tasks. Website: https://dki-lab.github.io/LLM-Planner
翻译:本研究侧重于使用大型语言模型(LLMs)作为规划器,为具有视觉知觉环境的具身代理人提供自然语言指令以完成复杂任务。现有方法的高数据成本和低样本效率阻碍了开发多任务并能快速学习新任务的多功能代理人的发展。在这项工作中,我们提出了一种新颖的方法LLM-Planner,利用大型语言模型进行少样本规划。我们进一步提出了一种简单但有效的方法,通过物理基础来增强LLMs,从而生成和更新在当前环境中基于物理基础的计划。在ALFRED数据集上的实验表明,我们的方法可以实现非常有竞争力的少样本性能:尽管使用不到0.5%的配对训练数据,LLM-Planner与使用全部训练数据训练的最新基线具有竞争力的性能。现有方法在同样的少样本设置下几乎无法成功完成任何任务。我们的工作为开发多功能和样本效率高的具身代理人打开了大门。网站:https://dki-lab.github.io/LLM-Planner