In this paper, we propose a novel learning scheme for self-supervised video representation learning. Motivated by how humans understand videos, we propose to first learn general visual concepts then attend to discriminative local areas for video understanding. Specifically, we utilize static frame and frame difference to help decouple static and dynamic concepts, and respectively align the concept distributions in latent space. We add diversity and fidelity regularizations to guarantee that we learn a compact set of meaningful concepts. Then we employ a cross-attention mechanism to aggregate detailed local features of different concepts, and filter out redundant concepts with low activations to perform local concept contrast. Extensive experiments demonstrate that our method distills meaningful static and dynamic concepts to guide video understanding, and obtains state-of-the-art results on UCF-101, HMDB-51, and Diving-48.
翻译:在本文中,我们提出一个自我监督的视频演示学习新颖的学习计划。在人类如何理解视频的激励下,我们建议首先学习一般视觉概念,然后关注歧视性的本地地区,以进行视频理解。具体地说,我们使用静态框架和框架差异来帮助分解静态和动态概念,并分别调整潜在空间的概念分布。我们添加多样性和忠诚规范,以确保我们学习一套紧凑的有意义的概念。然后我们使用交叉关注机制,汇总不同概念的详细地方特征,并用低活化率过滤冗余概念,以进行地方概念对比。广泛的实验表明,我们的方法提取了有意义的静态和动态概念,以指导视频理解,并获得了关于UCF-101、HMDB-51和Diving-48的最新成果。