In audio-visual navigation (AVN), an intelligent agent needs to navigate to a constantly sound-making object in complex 3D environments based on its audio and visual perceptions. While existing methods attempt to improve the navigation performance with preciously designed path planning or intricate task settings, none has improved the model generalisation on unheard sounds with task settings unchanged. We thus propose a contrastive learning-based method to tackle this challenge by regularising the audio encoder, where the sound-agnostic goal-driven latent representations can be learnt from various audio signals of different classes. In addition, we consider two data augmentation strategies to enrich the training sounds. We demonstrate that our designs can be easily equipped to existing AVN frameworks to obtain an immediate performance gain (13.4%$\uparrow$ in SPL on Replica and 12.2%$\uparrow$ in SPL on MP3D). Our project is available at https://AV-GeN.github.io/.
翻译:在视听导航(AVN)中,智能剂需要在其视听感知的基础上,在复杂的3D环境中向一个不断声学对象导航。虽然现有方法试图用精心设计的路径规划或复杂的任务设置来改进导航性能,但没有一种方法改进对未听声音的典型概括,任务设置没有改变。因此,我们提出一种反向学习方法,通过对音频编码器进行正规化来应对这一挑战,从不同类别的各种音频信号中可以学习声纳目标驱动的潜在表现。此外,我们考虑两种数据增强战略来丰富培训声音。我们证明,我们的设计可以很容易地安装到现有的AVN框架,以获得即时性能收益(复制SPL的13.4%\uprrory$和MP3DSPL的12.2%\uprrory$)。我们的项目可在https://AV-GeN.github.io/上查阅。