Language models (LMs) are shown to have commonsense knowledge of the physical world, which is fundamental for completing tasks in everyday situations. However, it is still an open question whether LMs have the ability to generate grounded, executable plans for embodied tasks. It is very challenging because LMs do not have an "eye" or "hand" to perceive the realistic environment. In this work, we show the first study on this important research question. We first present a novel problem formulation named G-PlanET, which takes as input a high-level goal and a table of objects in a specific environment. The expected output is a plan consisting of step-by-step instructions for agents to execute. To enable the study of this problem, we establish an evaluation protocol and devise a dedicated metric for assessing the quality of plans. In our extensive experiments, we show that adding flattened tables for encoding environments and using an iterative decoding strategy can both improve the LMs' ability for grounded planning. Our analysis of the results also leads to interesting non-trivial findings.
翻译:语言模型(LMS)显示具有对物理世界的常识性知识,这是完成日常工作的基础。然而,LMS是否有能力为包含的任务制定有根有据、可执行的计划,这仍然是一个未决问题。这非常具有挑战性,因为LMS没有“眼”或“手”来感知现实环境。在这项工作中,我们展示了有关这一重要研究问题的第一份研究。我们首先展示了一个名为G-PlanET的新问题配方,它将一个高层次的目标和特定环境中对象的表格作为投入。预期产出是一个由代理方渐进指令组成的计划。为了能够研究这一问题,我们制定了评估程序,并设计了评估计划质量的专门指标。在我们的广泛实验中,我们显示为编码环境添加平坦的表和使用迭代解码战略可以提高LMS进行基础规划的能力。我们对结果的分析还导致有趣的非三重结果。