Images are a convenient way to specify which particular object instance an embodied agent should navigate to. Solving this task requires semantic visual reasoning and exploration of unknown environments. We present a system that can perform this task in both simulation and the real world. Our modular method solves sub-tasks of exploration, goal instance re-identification, goal localization, and local navigation. We re-identify the goal instance in egocentric vision using feature-matching and localize the goal instance by projecting matched features to a map. Each sub-task is solved using off-the-shelf components requiring zero fine-tuning. On the HM3D InstanceImageNav benchmark, this system outperforms a baseline end-to-end RL policy 7x and a state-of-the-art ImageNav model 2.3x (56% vs 25% success). We deploy this system to a mobile robot platform and demonstrate effective real-world performance, achieving an 88% success rate across a home and an office environment.
翻译:摘要:图像是指定具体物体位置的一种方便方式。解决这个任务需要语义视觉推理和对未知环境的探索。我们提出了一种可以在仿真和实际世界中执行此任务的系统。我们的模块化方法解决了探索、目标实例重新识别、目标定位和本地导航的子任务。我们使用特征匹配在自我中心视野中重新识别目标实例,并通过将匹配的特征投影到地图上来定位目标实例。每个子任务都使用现成组件解决,无需微调。在HM3D实例图像导航基准测试中,该系统的性能优于基准的端到端强化学习策略7倍,优于最先进的ImageNav模型2.3倍(56%与25%的成功率)。我们将此系统部署到移动机器人平台上,并展示了有效的实际性能,在居家和办公环境中实现了88%的成功率。