We propose Spatio-temporal Crop Aggregation for video representation LEarning (SCALE), a novel method that enjoys high scalability at both training and inference time. Our model builds long-range video features by learning from sets of video clip-level features extracted with a pre-trained backbone. To train the model, we propose a self-supervised objective consisting of masked clip feature prediction. We apply sparsity to both the input, by extracting a random set of video clips, and to the loss function, by only reconstructing the sparse inputs. Moreover, we use dimensionality reduction by working in the latent space of a pre-trained backbone applied to single video clips. The video representation is then obtained by taking the ensemble of the concatenation of embeddings of separate video clips with a video clip set summarization token. These techniques make our method not only extremely efficient to train, but also highly effective in transfer learning. We demonstrate that our video representation yields state-of-the-art performance with linear, non-linear, and $k$-NN probing on common action classification datasets.
翻译:我们建议使用Spadio-时空作物聚合(Spatio-时间聚合)来进行视频代表 LEARING(SALE),这是一种在培训和推论时间都具有高度可缩放性的新方法。我们的模型通过学习用预先训练的脊柱提取的一组视频剪辑级特征来建立远程视频特征。为了培训模型,我们提出了一个由遮蔽的剪辑特征预测构成的自我监督目标。我们通过随机提取一组视频剪辑和损失功能,通过只重建稀释输入,对输入内容进行扩展。此外,我们通过在对单一视频剪辑应用的预先训练的骨架的潜在空间工作来降低维度。然后,通过将单独视频剪辑的视频剪辑嵌入的组合组合来获得视频特征。这些技术不仅使我们的教学方法非常高效,而且在传输学习中非常有效。我们通过视频显示,我们的视频显示通过直线、非线线和美元-NNN在共同行动分类数据集上生成的状态性表现。