Self-supervised audio-visual source localization aims to locate sound-source objects in video frames without extra annotations. Recent methods often approach this goal with the help of contrastive learning, which assumes only the audio and visual contents from the same video are positive samples for each other. However, this assumption would suffer from false negative samples in real-world training. For example, for an audio sample, treating the frames from the same audio class as negative samples may mislead the model and therefore harm the learned representations e.g., the audio of a siren wailing may reasonably correspond to the ambulances in multiple images). Based on this observation, we propose a new learning strategy named False Negative Aware Contrastive (FNAC) to mitigate the problem of misleading the training with such false negative samples. Specifically, we utilize the intra-modal similarities to identify potentially similar samples and construct corresponding adjacency matrices to guide contrastive learning. Further, we propose to strengthen the role of true negative samples by explicitly leveraging the visual features of sound sources to facilitate the differentiation of authentic sounding source regions. FNAC achieves state-of-the-art performances on Flickr-SoundNet, VGG-Sound, and AVSBench, which demonstrates the effectiveness of our method in mitigating the false negative issue. The code is available at \url{https://github.com/weixuansun/FNAC-AVL}.
翻译:自监督的音频-视觉源定位旨在没有额外注释的情况下定位视频帧中的声源对象。最近的方法通常利用对比学习来实现这一目标,该学习假定来自同一视频的音频和视觉内容彼此为正样本。然而,现实中的训练数据会存在虚假负样本,这会给学习带来困扰。例如,对于一个音频样本,将属于同一音频类别的帧视为负样本可能会误导模型,从而损害所学到的表示(例如,一辆警车里尖叫的声音可能合理地对应多个图像中的救护车)。 基于这一观察结果,我们提出了一种新的学习策略,称为虚假负样本感知对比学习(FNAC),以减轻这种虚假负样本问题。具体来说,我们利用本质上相似的样本来构建相应的邻接矩阵来指导对比学习。此外,我们提议通过明确利用声源的视觉特征来促进区分真实的声源区域,从而增强真实负样本的作用。 FNAC在Flickr-SoundNet、VGG-Sound和AVSBench中取得了最先进的性能,这表明我们的方法在减轻虚假负样本问题方面是有效的。 代码可在\url{https://github.com/weixuansun/FNAC-AVL}上找到。