We present a simple yet effective self-supervised framework for audio-visual representation learning, to localize the sound source in videos. To understand what enables to learn useful representations, we systematically investigate the effects of data augmentations, and reveal that (1) composition of data augmentations plays a critical role, {\em i.e.}~explicitly encouraging the audio-visual representations to be invariant to various transformations~({\em transformation invariance}); (2) enforcing geometric consistency substantially improves the quality of learned representations, {\em i.e.}~the detected sound source should follow the same transformation applied on input video frames~({\em transformation equivariance}). Extensive experiments demonstrate that our model significantly outperforms previous methods on two sound localization benchmarks, namely, Flickr-SoundNet and VGG-Sound. Additionally, we also evaluate audio retrieval and cross-modal retrieval tasks. In both cases, our self-supervised models demonstrate superior retrieval performances, even competitive with the supervised approach in audio retrieval. This reveals the proposed framework learns strong multi-modal representations that are beneficial to sound localisation and generalization to further applications. \textit{All codes will be available}.
翻译:我们为视听代表制学习提供了一个简单而有效的自我监督框架,使声音源在视频中本地化。为了了解能够学习有用的表达方式,我们系统地调查数据增强的效果,并揭示:(1)数据增强的构成发挥着关键的作用,即明确鼓励视听表达方式对各种变换不起作用;(2)实施几何一致性,大大提高了学习的表达方式的质量,也就是说,被检测到的音频源应当遵循输入视频框架~(用户变换等)的相同转换方式。广泛的实验表明,我们的模式大大优于先前关于两个声音化基准的方法,即Flickr-SoundNet和VGG-Sound。此外,我们还评估音频检索和跨模式检索任务。在这两种情况下,我们自我控制的模型都显示了优异的检索功能,甚至与音频检索的监督下方法具有竞争力。这显示了拟议的框架学会了强大的多模式表达方式,有利于健全的本地化和通用的应用程序。{Alltretrol}