We study unsupervised video representation learning that seeks to learn both motion and appearance features from unlabeled video only, which can be reused for downstream tasks such as action recognition. This task, however, is extremely challenging due to 1) the highly complex spatial-temporal information in videos; and 2) the lack of labeled data for training. Unlike the representation learning for static images, it is difficult to construct a suitable self-supervised task to well model both motion and appearance features. More recently, several attempts have been made to learn video representation through video playback speed prediction. However, it is non-trivial to obtain precise speed labels for the videos. More critically, the learnt models may tend to focus on motion pattern and thus may not learn appearance features well. In this paper, we observe that the relative playback speed is more consistent with motion pattern, and thus provide more effective and stable supervision for representation learning. Therefore, we propose a new way to perceive the playback speed and exploit the relative speed between two video clips as labels. In this way, we are able to well perceive speed and learn better motion features. Moreover, to ensure the learning of appearance features, we further propose an appearance-focused task, where we enforce the model to perceive the appearance difference between two video clips. We show that optimizing the two tasks jointly consistently improves the performance on two downstream tasks, namely action recognition and video retrieval. Remarkably, for action recognition on UCF101 dataset, we achieve 93.7% accuracy without the use of labeled data for pre-training, which outperforms the ImageNet supervised pre-trained model. Code and pre-trained models can be found at https://github.com/PeihaoChen/RSPNet.
翻译:我们研究未经监督的视频代表学习,试图从未贴标签的视频中学习动作和外观特征,这种学习只能从未贴标签的视频中学习,可以重新用于行动识别等下游任务。然而,这一任务极具挑战性,原因是:(1) 视频中的空间时空信息高度复杂;(2) 缺乏用于培训的标签数据。 与静态图像的演示学习不同,很难构建一个合适的自我监督任务来完善动作和外观功能的模型。 最近,有人试图通过视频回放速度预测来学习视频显示的图像代表和外观特征。然而,为视频获取准确的速度标签是非边际的。 更为关键的是, 学习的模型可能倾向于侧重于运动模式,从而可能无法很好地学习外观特征。 在本文中,我们观察到相对的回放速度与动作模式更加一致,从而提供更有效和更稳定的演示演示。 因此,我们提出了一种新的方法来看待回放模式,将两个视频前路段之间的相对速度作为标签。 以这种方式,我们能够很好地理解速度和学习更好的运动特征。 此外,我们还可以在两个图像的外观中学习两个图像显示任务。