Recent methods for embodied instruction following are typically trained end-to-end using imitation learning. This requires the use of expert trajectories and low-level language instructions. Such approaches assume learned hidden states will simultaneously integrate semantics from the language and vision to perform state tracking, spatial memory, exploration, and long-term planning. In contrast, we propose a modular method with structured representations that (1) builds a semantic map of the scene, and (2) performs exploration with a semantic search policy, to achieve the natural language goal. Our modular method achieves SOTA performance (24.46%) with a substantial (8.17 % absolute) gap from previous work while using less data by eschewing both expert trajectories and low-level instructions. Leveraging low-level language, however, can further increase our performance (26.49%). Our findings suggest that an explicit spatial memory and a semantic search policy can provide a stronger and more general representation for state-tracking and guidance, even in the absence of expert trajectories or low-level instructions.
翻译:以下体现式教学的近期方法通常是通过模仿学习进行培训的端到端方法。这需要使用专家轨迹和低层次语言指令。这些方法假定,从语言和愿景中学习的隐蔽状态将同时结合语言和愿景的语义,以进行国家跟踪、空间记忆、探索和长期规划。相比之下,我们建议采用模块化方法,其结构化表述:(1) 绘制一段语义地图,(2) 采用语义搜索政策进行探索,以实现自然语言目标。我们的模块化方法实现了SOTA性能(24.46%),与以往的工作有很大差距(8.17 % 绝对值),同时通过筛选专家轨迹和低层次指令使用较少的数据。然而,使用低层次语言可以进一步提高我们的性能(26.49% )。我们的研究结果表明,明确的空间记忆和语义搜索政策可以提供更有力和更全面的国家跟踪和指导代表,即使在没有专家轨迹或低层次指令的情况下也是如此。