The recent success of audio-visual representation learning can be largely attributed to their pervasive property of audio-visual synchronization, which can be used as self-annotated supervision. As a state-of-the-art solution, Audio-Visual Instance Discrimination (AVID) extends instance discrimination to the audio-visual realm. Existing AVID methods construct the contrastive set by random sampling based on the assumption that the audio and visual clips from all other videos are not semantically related. We argue that this assumption is rough, since the resulting contrastive sets have a large number of faulty negatives. In this paper, we overcome this limitation by proposing a novel Active Contrastive Set Mining (ACSM) that aims to mine the contrastive sets with informative and diverse negatives for robust AVID. Moreover, we also integrate a semantically-aware hard-sample mining strategy into our ACSM. The proposed ACSM is implemented into two most recent state-of-the-art AVID methods and significantly improves their performance. Extensive experiments conducted on both action and sound recognition on multiple datasets show the remarkably improved performance of our method.
翻译:最近视听代表性学习的成功在很大程度上可归因于视听同步这一普遍特性,可用作自我说明的监督。作为一种最先进的解决方案,视听实例歧视(AVID)将实例歧视扩大到视听领域。现有的AVID方法通过随机抽样构建了对比性组合,其依据的假设是,所有其他视频的视听视频片段与语义无关。我们争辩说,这一假设是粗糙的,因为由此产生的对比组合有许多缺点。在本文中,我们通过提出一部新颖的主动反向采掘(ACSM)来克服这一限制,它旨在用丰富的信息和多样的负面反向采掘出反向型组。此外,我们还将一个具有语义觉觉觉的硬抽样采掘战略纳入我们的ACSM战略。拟议的ACSM被实施为两种最新的最新的AVID方法,并大大改进了它们的业绩。在多个数据集上进行的广泛行动和正确识别实验,展示了我们方法的显著改进。