Learning effective motion features is an essential pursuit of video representation learning. This paper presents a simple yet effective sample construction strategy to boost the learning of motion features in video contrastive learning. The proposed method, dubbed Motion-focused Quadruple Construction (MoQuad), augments the instance discrimination by meticulously disturbing the appearance and motion of both the positive and negative samples to create a quadruple for each video instance, such that the model is encouraged to exploit motion information. Unlike recent approaches that create extra auxiliary tasks for learning motion features or apply explicit temporal modelling, our method keeps the simple and clean contrastive learning paradigm (i.e.,SimCLR) without multi-task learning or extra modelling. In addition, we design two extra training strategies by analyzing initial MoQuad experiments. By simply applying MoQuad to SimCLR, extensive experiments show that we achieve superior performance on downstream tasks compared to the state of the arts. Notably, on the UCF-101 action recognition task, we achieve 93.7% accuracy after pre-training the model on Kinetics-400 for only 200 epochs, surpassing various previous methods
翻译:学习有效的运动特征是视频演示学习的基本要求。 本文展示了一个简单而有效的样板构建策略, 以在视频对比学习中促进运动特征的学习。 提议的方法被称为“ 以运动为焦点的四重建筑( MoQuad) ”,它通过仔细干扰正反两个样本的外观和动作,为每个视频实例创造四重立,鼓励模型利用运动信息。 与最近为学习运动特征创造额外辅助任务或应用明确时间模型的方法不同,我们的方法保持简单而干净的对比学习模式( 即SimCLR),没有多任务学习或额外建模。 此外,我们设计了两个额外的培训战略,分析最初的莫夸德实验。 仅仅将莫夸德应用到SimCLR, 广泛的实验表明我们比艺术现状在下游任务上取得了优异的成绩。 值得注意的是, 在UCF- 101行动识别任务上,我们先为Kinetics- 400模型培训了93.7%的精度, 仅200个区, 超越了以前的各种方法。