The ability to accurately locate and navigate to a specific object is a crucial capability for embodied agents that operate in the real world and interact with objects to complete tasks. Such object navigation tasks usually require large-scale training in visual environments with labeled objects, which generalizes poorly to novel objects in unknown environments. In this work, we present a novel zero-shot object navigation method, Exploration with Soft Commonsense constraints (ESC), that transfers commonsense knowledge in pre-trained models to open-world object navigation without any navigation experience nor any other training on the visual environments. First, ESC leverages a pre-trained vision and language model for open-world prompt-based grounding and a pre-trained commonsense language model for room and object reasoning. Then ESC converts commonsense knowledge into navigation actions by modeling it as soft logic predicates for efficient exploration. Extensive experiments on MP3D, HM3D, and RoboTHOR benchmarks show that our ESC method improves significantly over baselines, and achieves new state-of-the-art results for zero-shot object navigation (e.g., 158% relative Success Rate improvement than CoW on MP3D).
翻译:能够准确定位和导航到特定物体是赋予行动代理在现实世界中操作和与对象交互以完成任务的重要能力。这种物体导航任务通常需要在带标记的视觉环境中进行大规模训练,但很难推广到未知环境中的新物体。在本研究中,我们提出了一种新颖的零样本物体导航方法,即带软通识约束的探索(ESC),将预训练模型中的通识知识转移到对视觉环境没有导航经验的情况下的开放领域物体导航。首先,ESC利用预训练的视觉和语言模型进行开放领域基于提示的连接,并利用预训练的通识语言模型进行房间和物体推理。然后,ESC通过将通识知识建模为软逻辑谓词,将其转化为导航动作,以实现高效的探索。在MP3D、HM3D和RoboTHOR基准测试上进行了大量实验,结果表明我们的ESC方法显着优于基线,并在零样本物体导航方面取得了新的最佳结果(例如,在MP3D上相对成功率比CoW提高了158%)。