This study focuses on embodied agents that can follow natural language instructions to complete complex tasks in a visually-perceived environment. Existing methods rely on a large amount of (instruction, gold trajectory) pairs to learn a good policy. The high data cost and poor sample efficiency prevents the development of versatile agents that are capable of many tasks and can learn new tasks quickly. In this work, we propose a novel method, LLM-Planner, that harnesses the power of large language models (LLMs) such as GPT-3 to do few-shot planning for embodied agents. We further propose a simple but effective way to enhance LLMs with physical grounding to generate plans that are grounded in the current environment. Experiments on the ALFRED dataset show that our method can achieve very competitive few-shot performance, even outperforming several recent baselines that are trained using the full training data despite using less than 0.5% of paired training data. Existing methods can barely complete any task successfully under the same few-shot setting. Our work opens the door for developing versatile and sample-efficient embodied agents that can quickly learn many tasks.
翻译:这项研究侧重于能够遵循自然语言指示,在视觉环境中完成复杂任务的内装剂。现有的方法依靠大量的成对(测试、金轨迹)来学习良好的政策。高数据成本和低抽样效率阻碍了能够完成许多任务并能迅速学习新任务的多才多艺剂的发展。在这项工作中,我们提出了一个创新方法,即LLM-Planner,利用诸如GPT-3等大型语言模型的力量,为外观物剂做几发式规划。我们进一步提出了一种简单而有效的方法,用物理地面加强LLMS,以产生基于当前环境的计划。在ALFRED数据集上进行的实验表明,我们的方法可以取得非常有竞争力的微小的性能,甚至超过最近几个使用全面培训数据而培训的基准,尽管使用的配对培训数据不到0.5%。在同样的几发式设置下,现有方法几乎无法成功完成任何任务。我们的工作为开发能够迅速学习许多任务的多才多采样和有节制剂打开了大门。