Contrastive learning applied to self-supervised representation learning has seen a resurgence in deep models. In this paper, we find that existing contrastive learning based solutions for self-supervised video recognition focus on inter-variance encoding but ignore the intra-variance existing in clips within the same video. We thus propose to learn dual representations for each clip which (\romannumeral 1) encode intra-variance through a shuffle-rank pretext task; (\romannumeral 2) encode inter-variance through a temporal coherent contrastive loss. Experiment results show that our method plays an essential role in balancing inter and intra variances and brings consistent performance gains on multiple backbones and contrastive learning frameworks. Integrated with SimCLR and pretrained on Kinetics-400, our method achieves $\textbf{82.0\%}$ and $\textbf{51.2\%}$ downstream classification accuracy on UCF101 and HMDB51 test sets respectively and $\textbf{46.1\%}$ video retrieval accuracy on UCF101, outperforming both pretext-task based and contrastive learning based counterparts.
翻译:适用于自我监督代表性学习的对比性学习在深层模型中再次出现。 在本文中,我们发现现有的自监督视频识别的对比性学习解决方案侧重于变量编码,但忽略了同一视频中剪辑中存在的差异内差异性。因此我们提议对每个剪辑(romannumberal 1)进行双重表述,这些剪辑(romannumber 1)通过打折式托辞任务将内部差异编码为内部差异;(\romannumber 2)通过时间一致性对比性损失来编码变量间差异。实验结果表明,我们的方法在平衡内部差异和内部差异方面发挥着重要作用,并在多个骨干和对比性学习框架中带来一致的业绩收益。与SimCLR相结合,在动因学-400上预先培训,我们的方法在UCF101和HMDB51测试中分别设定的下游分类精度和$\ textbf{46.1 ⁇ 美元,在UCF101上表现优于基于托辞和对比性学习的对等。