Cross-modal correlation provides an inherent supervision for video unsupervised representation learning. Existing methods focus on distinguishing different video clips by visual and audio representations. We human visual perception could attend to regions where sounds are made, and our auditory perception could also ground their frequencies of sounding objects, which we call bidirectional local correspondence. Such supervision is intuitive but not well explored in the contrastive learning framework. This paper introduces a pretext task, Cross-Modal Attention Consistency (CMAC), for exploring the bidirectional local correspondence property. The CMAC approach aims to align the regional attention generated purely from the visual signal with the target attention generated under the guidance of acoustic signal, and do a similar alignment for frequency grounding on the acoustic attention. Accompanied by a remoulded cross-modal contrastive loss where we consider additional within-modal interactions, the CMAC approach works effectively for enforcing the bidirectional alignment. Extensive experiments on six downstream benchmarks demonstrate that CMAC can improve the state-of-the-art performance on both visual and audio modalities.
翻译:现有方法侧重于通过视觉和音频表达方式区分不同视频剪辑。我们人类视觉感知可以关注发出声音的区域,我们的听觉感知也可以将探测物体的频率设为禁地,我们称之为双向地方通信。这种监督是直观的,但在对比式学习框架中没有很好探讨。本文提出了探索双向当地通信财产的托辞任务,即跨模式注意一致性(CMAC )。CMAC 方法旨在将纯从视觉信号中产生的区域关注与在声音信号指导下产生的目标关注点相匹配,并对声频定位进行类似的调整。在我们考虑其他模式内互动时,CMAC 方法由重新组合的跨模式对比性损失相配合,可以有效地执行双向协调。对六个下游基准的广泛实验表明CMAC能够改善视觉和音频模式的状态表现。