The underlying correlation between audio and visual modalities within videos can be utilized to learn supervised information for unlabeled videos. In this paper, we present an end-to-end self-supervised framework named Audio-Visual Contrastive Learning (AVCL), to learn discriminative audio-visual representations for action recognition. Specifically, we design an attention based multi-modal fusion module (AMFM) to fuse audio and visual modalities. To align heterogeneous audio-visual modalities, we construct a novel co-correlation guided representation alignment module (CGRA). To learn supervised information from unlabeled videos, we propose a novel self-supervised contrastive learning module (SelfCL). Furthermore, to expand the existing audio-visual action recognition datasets and better evaluate our framework AVCL, we build a new audio-visual action recognition dataset named Kinetics-Sounds100. Experimental results on Kinetics-Sounds32 and Kinetics-Sounds100 datasets demonstrate the superiority of our AVCL over the state-of-the-art methods on large-scale action recognition benchmark.
翻译:视频中视听模式之间的内在关联可以用来为未贴标签的视频学习受监督的信息。在本文中,我们提出了一个端到端自我监督的框架,名为“视听差异学习”(AVCL),以学习具有歧视性的视听表现,以确认行动。具体地说,我们设计了一个基于关注的多模式聚合模块(AMFM),以融合视听模式。为了协调各种视听模式,我们构建了一个新型的共生关系指导演示组合模块(CGRA)。为了从未贴标签的视频中学习受监督的信息,我们提议了一个新型的自我监督对比学习模块。此外,为了扩大现有的视听行动识别数据集,并更好地评估我们的AVCL框架,我们建立了一个新型的视听行动识别数据集,名为“动因学声音”100。关于基电学-声32和基电学-声音100数据集的实验结果显示了我们的AVCL优于大规模行动识别基准的先进方法。