We present a large-scale study on unsupervised spatiotemporal representation learning from videos. With a unified perspective on four recent image-based frameworks, we study a simple objective that can easily generalize all these methods to space-time. Our objective encourages temporally-persistent features in the same video, and in spite of its simplicity, it works surprisingly well across: (i) different unsupervised frameworks, (ii) pre-training datasets, (iii) downstream datasets, and (iv) backbone architectures. We draw a series of intriguing observations from this study, e.g., we discover that encouraging long-spanned persistency can be effective even if the timespan is 60 seconds. In addition to state-of-the-art results in multiple benchmarks, we report a few promising cases in which unsupervised pre-training can outperform its supervised counterpart. Code is made available at https://github.com/facebookresearch/SlowFast
翻译:我们通过对最近四个基于图像的框架进行统一的观点,研究一个简单的目标,可以很容易地将所有这些方法推广到空间时间。我们的目标鼓励在同一视频中保持时间性特征,尽管其简单易行,但它在以下几个方面效果惊人:(一) 不同的不受监督的框架,(二) 培训前数据集,(三) 下游数据集,和(四) 骨干结构。我们从这项研究中得出了一系列令人感兴趣的观察,例如,我们发现鼓励长期的持久性即使在时间跨度为60秒的情况下也是有效的。除了在多个基准中取得最新的结果外,我们报告了一些有希望的例子,即未经监督的培训前成果能够超过其监督的对应工具。守则可在https://github.com/facebourseearch/SlowFast查阅。