Efficiently finding targets in complex environments is fundamental to real-world embodied applications. While recent advances in multimodal foundation models have enabled zero-shot object goal navigation, allowing robots to search for arbitrary objects without fine-tuning, existing methods face two key limitations: (1) heavy reliance on precise depth and pose information provided by simulators, which restricts applicability in real-world scenarios; and (2) lack of in-context learning (ICL) capability, making it difficult to quickly adapt to new environments, as in leveraging short videos. To address these challenges, we propose RANGER, a novel zero-shot, open-vocabulary semantic navigation framework that operates using only a monocular camera. Leveraging powerful 3D foundation models, RANGER eliminates the dependency on depth and pose while exhibiting strong ICL capability. By simply observing a short video of a new environment, the system can also significantly improve task efficiency without requiring architectural modifications or fine-tuning. The framework integrates several key components: keyframe-based 3D reconstruction, semantic point cloud generation, vision-language model (VLM)-driven exploration value estimation, high-level adaptive waypoint selection, and low-level action execution. Experiments on the HM3D benchmark and real-world environments demonstrate that RANGER achieves competitive performance in terms of navigation success rate and exploration efficiency, while showing superior ICL adaptability, with no previous 3D mapping of the environment required.
翻译:在复杂环境中高效寻找目标是现实世界具身应用的基础。尽管多模态基础模型的最新进展使得零样本目标导航成为可能,允许机器人在无需微调的情况下搜索任意物体,但现有方法面临两个关键限制:(1)严重依赖模拟器提供的精确深度和姿态信息,这限制了其在真实场景中的适用性;(2)缺乏上下文学习能力,使其难以快速适应新环境,例如利用短视频。为应对这些挑战,我们提出了RANGER,一种新颖的仅使用单目相机操作的零样本、开放词汇语义导航框架。RANGER利用强大的3D基础模型,消除了对深度和姿态的依赖,同时展现出强大的上下文学习能力。通过简单地观察新环境的短视频,该系统还能显著提高任务效率,而无需进行架构修改或微调。该框架集成了几个关键组件:基于关键帧的3D重建、语义点云生成、视觉语言模型驱动的探索价值估计、高层自适应路径点选择以及低层动作执行。在HM3D基准测试和真实环境中的实验表明,RANGER在导航成功率和探索效率方面实现了具有竞争力的性能,同时展现出卓越的上下文学习适应能力,且无需预先进行环境的3D建图。