Contrastive learning has shown great potential in video representation learning. However, existing approaches fail to sufficiently exploit short-term motion dynamics, which are crucial to various down-stream video understanding tasks. In this paper, we propose Motion Sensitive Contrastive Learning (MSCL) that injects the motion information captured by optical flows into RGB frames to strengthen feature learning. To achieve this, in addition to clip-level global contrastive learning, we develop Local Motion Contrastive Learning (LMCL) with frame-level contrastive objectives across the two modalities. Moreover, we introduce Flow Rotation Augmentation (FRA) to generate extra motion-shuffled negative samples and Motion Differential Sampling (MDS) to accurately screen training samples. Extensive experiments on standard benchmarks validate the effectiveness of the proposed method. With the commonly-used 3D ResNet-18 as the backbone, we achieve the top-1 accuracies of 91.5\% on UCF101 and 50.3\% on Something-Something v2 for video classification, and a 65.6\% Top-1 Recall on UCF101 for video retrieval, notably improving the state-of-the-art.
翻译:然而,现有方法未能充分利用短期运动动态,而短期运动动态对于各种下游视频理解任务至关重要。在本文中,我们提议将光学流获取的运动信息输入RGB框架,以加强特征学习。为了做到这一点,除了短级全球对比学习外,我们还开发了地方运动差异学习(LMCL),在两个模式中设定了框架级对比目标。此外,我们引入了流动流动增强(FRA),以产生额外的运动压抑式负样和运动差异抽样,以准确筛选培训样本。关于标准基准的广泛实验验证了拟议方法的有效性。通过通常使用的3D ResNet-18作为主干线,我们实现了在UCFC101和50.3 ⁇ 上方的9.15 ⁇ 和50.3 ⁇ 关于视频分类的某些事物的首端点V2,以及在UCF101上进行了65.6 ⁇ 顶端1回调,以进行视频检索,特别是改进了该艺术的状态。