While interacting in the world is a multi-sensory experience, many robots continue to predominantly rely on visual perception to map and navigate in their environments. In this work, we propose Audio-Visual-Language Maps (AVLMaps), a unified 3D spatial map representation for storing cross-modal information from audio, visual, and language cues. AVLMaps integrate the open-vocabulary capabilities of multimodal foundation models pre-trained on Internet-scale data by fusing their features into a centralized 3D voxel grid. In the context of navigation, we show that AVLMaps enable robot systems to index goals in the map based on multimodal queries, e.g., textual descriptions, images, or audio snippets of landmarks. In particular, the addition of audio information enables robots to more reliably disambiguate goal locations. Extensive experiments in simulation show that AVLMaps enable zero-shot multimodal goal navigation from multimodal prompts and provide 50% better recall in ambiguous scenarios. These capabilities extend to mobile robots in the real world - navigating to landmarks referring to visual, audio, and spatial concepts. Videos and code are available at: https://avlmaps.github.io.
翻译:虽然交互是一种多感官体验,但许多机器人继续主要依赖视觉感知来绘制和导航它们的环境。在这项工作中,我们提出了一种称为音频-视觉-语言地图(AVLMaps)的统一的三维空间地图表示,用于存储来自音频、视觉和语言提示的跨模态信息。AVLMaps 将在互联网规模数据上预训练的多模型基础模型的开放词汇能力融合到一个集中的三维体素网格中。在导航的上下文中,我们展示了 AVLMaps 能够使机器人系统根据跨模态查询,例如文本描述、图像或地标的音频片段,在地图中索引目标。尤其是加入音频信息使得机器人能够更可靠地消除目标位置的歧义。在模拟实验中,AVLMaps 使得机器人能够从多模态提示中实现零-shot多模态目标导航,并在模糊的场景中提供更好的回忆率50%。这些能力延伸到现实中的移动机器人 - 导航至涉及视觉、音频和空间概念的地标。视频和代码可在此网址获得:https://avlmaps.github.io。