Recent work on audio-visual navigation targets a single static sound in noise-free audio environments and struggles to generalize to unheard sounds. We introduce the novel dynamic audio-visual navigation benchmark in which an embodied AI agent must catch a moving sound source in an unmapped environment in the presence of distractors and noisy sounds. We propose an end-to-end reinforcement learning approach that relies on a multi-modal architecture that fuses the spatial audio-visual information from a binaural audio signal and spatial occupancy maps to encode the features needed to learn a robust navigation policy for our new complex task settings. We demonstrate that our approach outperforms the current state-of-the-art with better generalization to unheard sounds and better robustness to noisy scenarios on the two challenging 3D scanned real-world datasets Replica and Matterport3D, for the static and dynamic audio-visual navigation benchmarks. Our novel benchmark will be made available at http://dav-nav.cs.uni-freiburg.de.
翻译:最近关于视听导航的工作针对的是无噪音音响环境中的单一静态声音,并努力推广到未听到的声音。我们引入了新型动态视听导航基准,在这个基准中,一个体现的AI代理必须在一个没有绘图的环境中,在分流器和吵闹的声音面前捕捉一个移动的音源。我们建议采用一个端对端强化学习方法,该方法依靠一种多模式结构,将空间视听信息从双声音频信号和空间占用图中结合起来,以编码为我们新的复杂任务设置学习稳健导航政策所需的特征。我们证明,我们的方法优于当前状态的艺术,更好地概括了未听到的声音,并更加稳健地在两个具有挑战性的3D扫描实时数据集Reclic和Motalport3D上的噪音情景,用于静态和动态视听导航基准。我们的新基准将在http://dav-nav.cs.uni-freiburg.de上公布。我们的新基准将在http://dav-freiburg.de上公布。