Visual target navigation in unknown environments is a crucial problem in robotics. Despite extensive investigation of classical and learning-based approaches in the past, robots lack common-sense knowledge about household objects and layouts. Prior state-of-the-art approaches to this task rely on learning the priors during the training and typically require significant expensive resources and time for learning. To address this, we propose a new framework for visual target navigation that leverages Large Language Models (LLM) to impart common sense for object searching. Specifically, we introduce two paradigms: (i) zero-shot and (ii) feed-forward approaches that use language to find the relevant frontier from the semantic map as a long-term goal and explore the environment efficiently. Our analysis demonstrates the notable zero-shot generalization and transfer capabilities from the use of language. Experiments on Gibson and Habitat-Matterport 3D (HM3D) demonstrate that the proposed framework significantly outperforms existing map-based methods in terms of success rate and generalization. Ablation analysis also indicates that the common-sense knowledge from the language model leads to more efficient semantic exploration. Finally, we provide a real robot experiment to verify the applicability of our framework in real-world scenarios. The supplementary video and code can be accessed via the following link: https://sites.google.com/view/l3mvn.
翻译:在未知环境中进行视觉目标导航是机器人领域的一个关键问题。尽管过去进行了大量的经典和基于学习的方法的研究,但是机器人缺乏有关家庭物品和布局的常识知识。先前的最先进的方法依赖于在训练期间学习先验知识,并且通常需要显著的昂贵资源和时间来进行学习。为了解决这个问题,我们提出了一种新的视觉目标导航框架,该框架利用大型语言模型(LLM)来为对象搜索提供常识知识。具体而言,我们引入了两种范例:(i)零样本和(ii)前馈方法,利用语言在语义地图中查找相关前沿作为长期目标并有效地探索环境。我们的分析表明了从语言使用中显着的零样本泛化和转移能力。在Gibson和Habitat-Matterport 3D(HM3D)上的实验表明,该提议的框架在成功率和泛化方面显着优于现有的基于地图的方法。消融分析还表明,语言模型中的常识知识可以导致更有效的语义探索。最后,我们提供了一个真实机器人实验,以验证我们的框架在实际场景中的适用性。补充视频和代码可以通过以下链接访问:https://sites.google.com/view/l3mvn。