Audio-visual embodied navigation, as a hot research topic, aims training a robot to reach an audio target using egocentric visual (from the sensors mounted on the robot) and audio (emitted from the target) input. The audio-visual information fusion strategy is naturally important to the navigation performance, but the state-of-the-art methods still simply concatenate the visual and audio features, potentially ignoring the direct impact of context. Moreover, the existing approaches requires either phase-wise training or additional aid (e.g. topology graph and sound semantics). Up till this date, the work that deals with the more challenging setup with moving target(s) is still rare. As a result, we propose an end-to-end framework FSAAVN (feature self-attention audio-visual navigation) to learn chasing after a moving audio target using a context-aware audio-visual fusion strategy implemented as a self-attention module. Our thorough experiments validate the superior performance (both quantitatively and qualitatively) of FSAAVN in comparison with the state-of-the-arts, and also provide unique insights about the choice of visual modalities, visual/audio encoder backbones and fusion patterns.
翻译:视听成像导航,作为一个热点研究专题,目的是训练机器人使用自我中心视觉(机器人上安装的传感器)和音频(目标上输入的)输入实现音频目标。视听信息融合战略自然对导航性能很重要,但最先进的方法仍然只是将视觉和音频特征混为一体,有可能忽视环境的直接影响。此外,现有方法需要分阶段培训或额外帮助(例如,地形图和声音语义学)。迄今为止,处理移动目标更具挑战性设置的工作仍然很少。结果,我们提议一个端对端框架FSAAVN(自然自省视听导航),以便学习如何在移动音频目标之后,利用一个环境觉视的视听融合战略作为自我保存模块实施。我们彻底的实验验证了FSAAVN的优异性(定量和定性)与状态相比,还提供关于视觉、视觉/直观模式的选择的独特洞察力。