Learning temporal correspondence from unlabeled videos is of vital importance in computer vision, and has been tackled by different kinds of self-supervised pretext tasks. For the self-supervised learning, recent studies suggest using large-scale video datasets despite the training cost. We propose a spatial-then-temporal pretext task to address the training data cost problem. The task consists of two steps. First, we use contrastive learning from unlabeled still image data to obtain appearance-sensitive features. Then we switch to unlabeled video data and learn motion-sensitive features by reconstructing frames. In the second step, we propose a global correlation distillation loss to retain the appearance sensitivity learned in the first step, as well as a local correlation distillation loss in a pyramid structure to combat temporal discontinuity. Experimental results demonstrate that our method surpasses the state-of-the-art self-supervised methods on a series of correspondence-based tasks. The conducted ablation studies verify the effectiveness of the proposed two-step task and loss functions.
翻译:在计算机视野中,从未贴标签的视频中学习时间通信至关重要,并已通过各种自我监督的托辞任务加以解决。在自我监督的学习中,最近的研究表明,尽管培训成本很高,但仍使用大型视频数据集。我们建议用空间-时空的托辞来解决培训数据成本问题。任务包括两个步骤。首先,我们从未贴标签的仍图像数据中学习对比,以获得外观敏感功能。然后,我们转而使用未贴标签的视频数据,通过重建框架学习运动敏感功能。第二步,我们提出全球相关性蒸馏损失,以保持第一步所学的外观敏感性,以及在一个金字塔结构中进行局部相关蒸馏损失,以对抗时间性不连续性。实验结果表明,我们的方法超过了一系列基于通信的任务中最先进的自我监督方法。我们进行了对比研究,以核实拟议的两步任务和损失功能的有效性。