未经监督的录像-审阅学习的一致性 (Cross-Modal Attention Consistency for Video-Audio Unsupervised Learning)

Cross-modal correlation provides an inherent supervision for video unsupervised representation learning. Existing methods focus on distinguishing different video clips by visual and audio representations. We human visual perception could attend to regions where sounds are made, and our auditory perception could also ground their frequencies of sounding objects, which we call bidirectional local correspondence. Such supervision is intuitive but not well explored in the contrastive learning framework. This paper introduces a pretext task, Cross-Modal Attention Consistency (CMAC), for exploring the bidirectional local correspondence property. The CMAC approach aims to align the regional attention generated purely from the visual signal with the target attention generated under the guidance of acoustic signal, and do a similar alignment for frequency grounding on the acoustic attention. Accompanied by a remoulded cross-modal contrastive loss where we consider additional within-modal interactions, the CMAC approach works effectively for enforcing the bidirectional alignment. Extensive experiments on six downstream benchmarks demonstrate that CMAC can improve the state-of-the-art performance on both visual and audio modalities.

翻译：现有方法侧重于通过视觉和音频表达方式区分不同视频剪辑。我们人类视觉感知可以关注发出声音的区域,我们的听觉感知也可以将探测物体的频率设为禁地,我们称之为双向地方通信。这种监督是直观的,但在对比式学习框架中没有很好探讨。本文提出了探索双向当地通信财产的托辞任务,即跨模式注意一致性(CMAC )。CMAC 方法旨在将纯从视觉信号中产生的区域关注与在声音信号指导下产生的目标关注点相匹配,并对声频定位进行类似的调整。在我们考虑其他模式内互动时,CMAC 方法由重新组合的跨模式对比性损失相配合,可以有效地执行双向协调。对六个下游基准的广泛实验表明CMAC能够改善视觉和音频模式的状态表现。

相关内容

注意力机制

关注 120

Attention机制最早是在视觉图像领域提出来的，但是真正火起来应该算是google mind团队的这篇论文《Recurrent Models of Visual Attention》[14]，他们在RNN模型上使用了attention机制来进行图像分类。随后，Bahdanau等人在论文《Neural Machine Translation by Jointly Learning to Align and Translate》 [1]中，使用类似attention的机制在机器翻译任务上将翻译和对齐同时进行，他们的工作算是是第一个提出attention机制应用到NLP领域中。接着类似的基于attention机制的RNN模型扩展开始应用到各种NLP任务中。最近，如何在CNN中使用attention机制也成为了大家的研究热点。下图表示了attention研究进展的大概趋势。

【Google】监督对比学习，Supervised Contrastive Learning

专知会员服务

75+阅读 · 2020年4月24日

强化学习的对比无监督表示，CURL: Contrastive Unsupervised Representations for Reinforcement Learning

专知会员服务

41+阅读 · 2020年4月11日

【ACL2020-Facebook AI】跨语言表示学习，Unsupervised Cross-lingual Representation Learning at Scale

专知会员服务

27+阅读 · 2020年4月5日

【香港中文大学-CVPR2020】Rotate-and-Render: Unsupervised Photorealistic Face Rotation from Single-View Images

专知会员服务

22+阅读 · 2020年3月18日