Planning is an important capability of artificial agents that perform long-horizon tasks in real-world environments. In this work, we explore the use of pre-trained language models (PLMs) to reason about plan sequences from text instructions in embodied visual environments. Prior PLM based approaches for planning either assume observations are available in the form of text (e.g., provided by a captioning model), reason about plans from the instruction alone, or incorporate information about the visual environment in limited ways (such as a pre-trained affordance function). In contrast, we show that PLMs can accurately plan even when observations are directly encoded as input prompts for the PLM. We show that this simple approach outperforms prior approaches in experiments on the ALFWorld and VirtualHome benchmarks.
翻译:规划是人工智能代理在真实环境下执行长期任务的重要能力。在这项工作中,我们探索了预训练语言模型(PLMs)在基于视觉的环境中从文本指令中推理出计划序列的使用。以往基于PLM的规划方法要么假定观察结果以文本形式提供(例如由字幕模型提供),要么仅从指令中推理计划,或者在有限的方式下将有关视觉环境的信息(例如预训练的能力函数)纳入考虑。相比之下,我们显示PLMs可以准确地规划,即使通过将观察结果直接编码为PLM的输入提示也可以。在ALFWorld和VirtualHome基准测试中,我们的实验表明,这种简单的方法优于以往的方法。