We present a simple yet effective self-supervised framework for audio-visual representation learning, to localize the sound source in videos. To understand what enables to learn useful representations, we systematically investigate the effects of data augmentations, and reveal that (1) composition of data augmentations plays a critical role, i.e. explicitly encouraging the audio-visual representations to be invariant to various transformations~({\em transformation invariance}); (2) enforcing geometric consistency substantially improves the quality of learned representations, i.e. the detected sound source should follow the same transformation applied on input video frames~({\em transformation equivariance}). Extensive experiments demonstrate that our model significantly outperforms previous methods on two sound localization benchmarks, namely, Flickr-SoundNet and VGG-Sound. Additionally, we also evaluate audio retrieval and cross-modal retrieval tasks. In both cases, our self-supervised models demonstrate superior retrieval performances, even competitive with the supervised approach in audio retrieval. This reveals the proposed framework learns strong multi-modal representations that are beneficial to sound localisation and generalization to further applications. \textit{All codes will be available}.
翻译:我们为视听代表制学习提供了一个简单而有效的自我监督框架,使声音源在视频中本地化。为了了解什么能够学习有用的表达方式,我们系统地调查数据增强的效果,并揭示:(1)数据增强的构成发挥着关键的作用,即明确鼓励视听表达方式对各种变换不起作用;(2)实施几何一致性,大大提高了学习的表述质量,即所检测到的音频源应当遵循输入视频框架~((them transferation equality})的相同转换。广泛的实验表明,我们的模型大大优于先前两种稳健的本地化基准方法,即Flickr-SoundNet和VGG-Sound。此外,我们还评估音频检索和跨式检索任务。在这两种情况下,我们自我监督的模式都显示了优异的检索性,甚至与音频检索中受监督的方法具有竞争力。这揭示了拟议的框架学习了强大的多式表述方式,有利于健全的本地化和通用应用。\Textitit{All cols will be sable.