We study unsupervised video representation learning that seeks to learn both motion and appearance features from unlabeled video only, which can be reused for downstream tasks such as action recognition. This task, however, is extremely challenging due to 1) the highly complex spatial-temporal information in videos; and 2) the lack of labeled data for training. Unlike the representation learning for static images, it is difficult to construct a suitable self-supervised task to well model both motion and appearance features. More recently, several attempts have been made to learn video representation through video playback speed prediction. However, it is non-trivial to obtain precise speed labels for the videos. More critically, the learnt models may tend to focus on motion pattern and thus may not learn appearance features well. In this paper, we observe that the relative playback speed is more consistent with motion pattern, and thus provide more effective and stable supervision for representation learning. Therefore, we propose a new way to perceive the playback speed and exploit the relative speed between two video clips as labels. In this way, we are able to well perceive speed and learn better motion features. Moreover, to ensure the learning of appearance features, we further propose an appearance-focused task, where we enforce the model to perceive the appearance difference between two video clips. We show that optimizing the two tasks jointly consistently improves the performance on two downstream tasks, namely action recognition and video retrieval. Remarkably, for action recognition on UCF101 dataset, we achieve 93.7% accuracy without the use of labeled data for pre-training, which outperforms the ImageNet supervised pre-trained model. Code and pre-trained models can be found at https://github.com/PeihaoChen/RSPNet.


翻译:我们研究未经监督的视频代表学习,试图从未贴标签的视频中学习动作和外观特征,这种学习只能从未贴标签的视频中学习,可以重新用于行动识别等下游任务。然而,这一任务极具挑战性,原因是:(1) 视频中的空间时空信息高度复杂;(2) 缺乏用于培训的标签数据。 与静态图像的演示学习不同,很难构建一个合适的自我监督任务来完善动作和外观功能的模型。 最近,有人试图通过视频回放速度预测来学习视频显示的图像代表和外观特征。然而,为视频获取准确的速度标签是非边际的。 更为关键的是, 学习的模型可能倾向于侧重于运动模式,从而可能无法很好地学习外观特征。 在本文中,我们观察到相对的回放速度与动作模式更加一致,从而提供更有效和更稳定的演示演示。 因此,我们提出了一种新的方法来看待回放模式,将两个视频前路段之间的相对速度作为标签。 以这种方式,我们能够很好地理解速度和学习更好的运动特征。 此外,我们还可以在两个图像的外观中学习两个图像显示任务。

1
下载
关闭预览

相关内容

零样本文本分类,Zero-Shot Learning for Text Classification
专知会员服务
95+阅读 · 2020年5月31日
100+篇《自监督学习(Self-Supervised Learning)》论文最新合集
专知会员服务
164+阅读 · 2020年3月18日
专知会员服务
53+阅读 · 2019年12月22日
[综述]深度学习下的场景文本检测与识别
专知会员服务
77+阅读 · 2019年10月10日
Transferring Knowledge across Learning Processes
CreateAMind
27+阅读 · 2019年5月18日
强化学习的Unsupervised Meta-Learning
CreateAMind
17+阅读 · 2019年1月7日
无监督元学习表示学习
CreateAMind
27+阅读 · 2019年1月4日
Unsupervised Learning via Meta-Learning
CreateAMind
42+阅读 · 2019年1月3日
disentangled-representation-papers
CreateAMind
26+阅读 · 2018年9月12日
Hierarchical Disentangled Representations
CreateAMind
4+阅读 · 2018年4月15日
Arxiv
14+阅读 · 2021年3月10日
VIP会员
Top
微信扫码咨询专知VIP会员