The objective of this work is to localize the sound sources in visual scenes. Existing audio-visual works employ contrastive learning by assigning corresponding audio-visual pairs from the same source as positives while randomly mismatched pairs as negatives. However, these negative pairs may contain semantically matched audio-visual information. Thus, these semantically correlated pairs, "hard positives", are mistakenly grouped as negatives. Our key contribution is showing that hard positives can give similar response maps to the corresponding pairs. Our approach incorporates these hard positives by adding their response maps into a contrastive learning objective directly. We demonstrate the effectiveness of our approach on VGG-SS and SoundNet-Flickr test sets, showing favorable performance to the state-of-the-art methods.
翻译:这项工作的目标是在视觉场景中将声源本地化。 现有的视听作品采用对比式学习方法,从同一来源分配相应的声视频配对作为正片,而随机错配的配对作为负片。 但是,这些负对可能包含音义匹配的视听信息。 因此,这些音频相关配对“硬正片”被错误地归类为负片。 我们的主要贡献是显示硬正片可以给对应对提供类似的响应地图。 我们的方法将这些硬正片纳入到一个对比式的学习目标中。 我们展示了我们在VGG-SS和SoundNet-Flickr测试组上的做法的有效性,展示了最先进的方法的优性能。