We propose Spatio-temporal Crop Aggregation for video representation LEarning (SCALE), a novel method that enjoys high scalability at both training and inference time. Our model builds long-range video features by learning from sets of video clip-level features extracted with a pre-trained backbone. To train the model, we propose a self-supervised objective consisting of masked clip feature prediction. We apply sparsity to both the input, by extracting a random set of video clips, and to the loss function, by only reconstructing the sparse inputs. Moreover, we use dimensionality reduction by working in the latent space of a pre-trained backbone applied to single video clips. These techniques make our method not only extremely efficient to train but also highly effective in transfer learning. We demonstrate that our video representation yields state-of-the-art performance with linear, non-linear, and KNN probing on common action classification and video understanding datasets.
翻译:我们提议将时空的Spadio作物聚合用于视频代表LEARING(SALE),这是一种在培训和推论时间都具有高度可扩缩性的新颖方法。我们的模型通过学习以预先训练的骨干制成的一组视频剪辑级特征来建立远程视频特征。为了培训模型,我们提议了一个由遮蔽的剪辑特征预测组成的自我监督目标。我们通过随机抽取一组视频剪辑来对输入内容和损失功能进行宽度,只对稀释的输入内容进行重建。此外,我们通过在用于单一视频剪辑的预先训练的骨架的潜在空间工作来减少维度。这些技术不仅使我们的培训方法非常高效,而且在传输学习方面也非常有效。我们证明我们的视频表现通过直线、非线性和非线性以及KNNN在共同行动分类和视频理解数据集上进行演示,产生最新的艺术性表现。</s>