The ability to accurately locate and navigate to a specific object is a crucial capability for embodied agents that operate in the real world and interact with objects to complete tasks. Such object navigation tasks usually require large-scale training in visual environments with labeled objects, which generalizes poorly to novel objects in unknown environments. In this work, we present a novel zero-shot object navigation method, Exploration with Soft Commonsense constraints (ESC), that transfers commonsense knowledge in pre-trained models to open-world object navigation without any navigation experience nor any other training on the visual environments. First, ESC leverages a pre-trained vision and language model for open-world prompt-based grounding and a pre-trained commonsense language model for room and object reasoning. Then ESC converts commonsense knowledge into navigation actions by modeling it as soft logic predicates for efficient exploration. Extensive experiments on MP3D, HM3D, and RoboTHOR benchmarks show that our ESC method improves significantly over baselines, and achieves new state-of-the-art results for zero-shot object navigation (e.g., 225\% relative Success Rate improvement than CoW on MP3D).
翻译:精确定位和导航到特定天体的能力对于在现实世界中运行并与物体互动以完成任务的装饰物剂来说是关键的能力。这类物体导航任务通常需要在带有标签物体的视觉环境中进行大规模培训,这种培训在未知环境中对新物体进行概括化。在这项工作中,我们提出了一种新的零射物体导航方法,即 " 与软共振限制进行探索 " (ESC),将预先培训模型中的常识知识转移到开放世界天体导航,而没有任何导航经验,也没有关于视觉环境的任何其他培训。首先,ESC利用预先培训的视野和语言模型,用于开放世界快速地基地面定位,并使用预先培训的通用语言模型进行房间和对象推理。然后,ESC将普通知识转换为导航行动,将其建模为用于有效探索的软逻辑假设。关于MP3D、HM3D和RoboTHOR基准的广泛实验表明,我们的ESC方法大大改进了基线,并实现了零射物体导航的新的状态结果(例如,225-Q-Q-MW上的相对成功率)。