The quality of the image representations obtained from self-supervised learning depends strongly on the type of data augmentations used in the learning formulation. Recent papers have ported these methods from still images to videos and found that leveraging both audio and video signals yields strong gains; however, they did not find that spatial augmentations such as cropping, which are very important for still images, work as well for videos. In this paper, we improve these formulations in two ways unique to the spatio-temporal aspect of videos. First, for space, we show that spatial augmentations such as cropping do work well for videos too, but that previous implementations, due to the high processing and memory cost, could not do this at a scale sufficient for it to work well. To address this issue, we first introduce Feature Crop, a method to simulate such augmentations much more efficiently directly in feature space. Second, we show that as opposed to naive average pooling, the use of transformer-based attention improves performance significantly, and is well suited for processing feature crops. Combining both of our discoveries into a new method, Space-Time Crop & Attend (STiCA) we achieve state-of-the-art performance across multiple video-representation learning benchmarks. In particular, we achieve new state-of-the-art accuracies of 67.0% on HMDB-51 and 93.1% on UCF-101 when pre-training on Kinetics-400.
翻译:从自我监督学习中获得的图像表现质量在很大程度上取决于学习配制中使用的数据增强类型。最近的文件将这些方法从静止图像移植到视频,发现利用音频和视频信号带来巨大收益;然而,他们没有发现诸如裁剪等空间增强对于静态图像非常重要,对于工作以及视频来说也非常重要。在本文中,我们将这些配方改进了两种对视频片段-时空方面独特的方式。首先,对于空间而言,我们表明,诸如裁剪等空间增强对视频制作工作也很有效,但是由于处理和记忆成本高,以往的实施无法在足以使其运作良好的规模上做到这一点。为了解决这一问题,我们首先引入了地貌作物,这是直接模拟这种增强的一种方法。第二,我们展示的是,相对于天真的平均集合,使用基于变换的注意力可以显著改善性能,并且非常适合处理地貌作物。我们将我们发现的新发现合并成一种新的方法,空间作物和记忆成本高,因此无法在一定比例的VICO-MFM(ST-CA-CA)上实现一定的图像表现标准。我们在整个状态上实现了一定的V-比例。