In low-level video analyses, effective representations are important to derive the correspondences between video frames. These representations have been learned in a self-supervised fashion from unlabeled images/videos, using carefully designed pretext tasks in some recent studies. However, the previous work concentrates on either spatial-discriminative features or temporal-repetitive features, with little attention to the synergy between spatial and temporal cues. To address this issue, we propose a novel spatial-then-temporal self-supervised learning method. Specifically, we firstly extract spatial features from unlabeled images via contrastive learning, and secondly enhance the features by exploiting the temporal cues in unlabeled videos via reconstructive learning. In the second step, we design a global correlation distillation loss to ensure the learning not to forget the spatial cues, and we design a local correlation distillation loss to combat the temporal discontinuity that harms the reconstruction. The proposed method outperforms the state-of-the-art self-supervised methods, as established by the experimental results on a series of correspondence-based video analysis tasks. Also, we performed ablation studies to verify the effectiveness of the two-step design as well as the distillation losses.
翻译:在低层次的视频分析中,有效的表达方式对于获得视频框架之间的对应关系非常重要。这些表达方式是利用最近一些研究中精心设计的借口任务,从未贴标签的图像/视频中以自我监督的方式学习的。然而,以往的工作侧重于空间差异特征或时间重复特征,很少注意空间和时间提示之间的协同效应。为了解决这一问题,我们提议了一种新的空间-时空自我监督学习方法。具体地说,我们首先通过对比性学习从未贴标签的图像中提取空间特征,其次,通过重建性学习利用未贴标签视频中的时间提示来强化这些特征。在第二步,我们设计了一种全球相关性蒸馏损失,以确保学习不要忘记空间提示,我们设计了本地相关性蒸馏损失,以消除损害重建的时断性。拟议方法超越了根据一系列基于对等的视频分析实验结果确定的状态的自我强化方法,我们进行了两个基于基于通信的图像分析步骤的实验结果的研究,以核实设计的有效性。此外,我们还进行了一种研究,以持续进行设计。