Language models (LMs) have demonstrated their capability in possessing commonsense knowledge of the physical world, a crucial aspect of performing tasks in everyday life. However, it remains unclear whether they have the capacity to generate grounded, executable plans for embodied tasks. This is a challenging task as LMs lack the ability to perceive the environment through vision and feedback from the physical environment. In this paper, we address this important research question and present the first investigation into the topic. Our novel problem formulation, named G-PlanET, inputs a high-level goal and a data table about objects in a specific environment, and then outputs a step-by-step actionable plan for a robotic agent to follow. To facilitate the study, we establish an evaluation protocol and design a dedicated metric, KAS, to assess the quality of the plans. Our experiments demonstrate that the use of tables for encoding the environment and an iterative decoding strategy can significantly enhance the LMs' ability in grounded planning. Our analysis also reveals interesting and non-trivial findings.
翻译:语言模型(LMS)展示了它们拥有对物质世界的常识知识的能力,这是日常生活中执行任务的一个重要方面,然而,尚不清楚它们是否有能力为包含的任务制定有根据的、可执行的计划。这是一项具有挑战性的任务,因为LMS缺乏通过视觉和来自物理环境的反馈来看待环境的能力。在本文件中,我们探讨了这一重要的研究问题,并首次对这个专题进行了调查。我们的新颖的问题提法,名为G-PlanET,为特定环境中的物体输入了一个高层次的目标和一个数据表,然后为机器人代理者制作了一个可逐步执行的计划。为了便利研究,我们制定了评估程序,并设计了评估计划质量的专门指标KASS。我们的实验表明,使用环境编码表格和迭代解码战略可以大大提高LMs在基础规划方面的能力。我们的分析还揭示了有趣的、非细微的发现。