The objective of this work is to localize sound sources that are visible in a video without using manual annotations. Our key technical contribution is to show that, by training the network to explicitly discriminate challenging image fragments, even for images that do contain the object emitting the sound, we can significantly boost the localization performance. We do so elegantly by introducing a mechanism to mine hard samples and add them to a contrastive learning formulation automatically. We show that our algorithm achieves state-of-the-art performance on the popular Flickr SoundNet dataset. Furthermore, we introduce the VGG-Sound Source (VGG-SS) benchmark, a new set of annotations for the recently-introduced VGG-Sound dataset, where the sound sources visible in each video clip are explicitly marked with bounding box annotations. This dataset is 20 times larger than analogous existing ones, contains 5K videos spanning over 200 categories, and, differently from Flickr SoundNet, is video-based. On VGG-SS, we also show that our algorithm achieves state-of-the-art performance against several baselines.
翻译:这项工作的目标是在不使用手动说明的情况下将视频中可见的音频源本地化。 我们的主要技术贡献是显示,通过培训网络明确区分挑战性图像碎片,甚至对含有声音发射对象的图像,我们可以大大提升本地化性能。 我们这样做的精致方式是引入一个机制来挖掘硬体样本,并将它们自动添加到对比性学习配方中。 我们显示我们的算法在流行的Flickr声音网数据集中实现了最先进的性能。 此外,我们还引入了VGG-Sound源基准(VGG-SS),这是最近推出的VGG-Sound数据集的新说明,其中每个视频短片中可见的音源都明确标有捆绑式的插图。 这个数据集比现有相近的多20倍,包含5K视频,覆盖200多个类别,与Flicks SoundNet不同,是视频基础的。 在VGGG-SS上,我们还展示我们的算法在几个基线下实现了最新性能。