In low-level video analyses, effective representations are important to derive the correspondences between video frames. These representations have been learned in a self-supervised fashion from unlabeled images or videos, using carefully designed pretext tasks in some recent studies. However, the previous work concentrates on either spatial-discriminative features or temporal-repetitive features, with little attention to the synergy between spatial and temporal cues. To address this issue, we propose a spatial-then-temporal self-supervised learning method. Specifically, we firstly extract spatial features from unlabeled images via contrastive learning, and secondly enhance the features by exploiting the temporal cues in unlabeled videos via reconstructive learning. In the second step, we design a global correlation distillation loss to ensure the learning not to forget the spatial cues, and a local correlation distillation loss to combat the temporal discontinuity that harms the reconstruction. The proposed method outperforms the state-of-the-art self-supervised methods, as established by the experimental results on a series of correspondence-based video analysis tasks. Also, we performed ablation studies to verify the effectiveness of the two-step design as well as the distillation losses.
翻译:在低级别的视频分析中,有效的表示对于推导视频帧之间的对应关系至关重要。这些表示已经在最近的一些研究中,通过设计良好的先验任务,从未标记的图像或视频中以自监督的方式学习得到。然而,以往的工作集中于空间判别特征或时间重复特征之一,对空间和时间线索之间的协同关系关注较少。为解决这个问题,我们提出了一种空间-时间自监督学习的方法。具体来说,我们首先通过对比学习从未标记的图像中提取空间特征,然后通过重建学习利用未标记的视频中的时间线索增强特征。在第二步中,我们设计了全局相关性蒸馏损失来确保学习不会忘记空间线索,以及局部相关性蒸馏损失来抵御破坏重建的时间不连续性。所提出的方法在一系列基于对应关系的视频分析任务上优于现有的自监督方法,同时我们进行了消融实验以验证两步设计以及蒸馏损失的有效性。