We present pure-transformer based models for video classification, drawing upon the recent success of such models in image classification. Our model extracts spatio-temporal tokens from the input video, which are then encoded by a series of transformer layers. In order to handle the long sequences of tokens encountered in video, we propose several, efficient variants of our model which factorise the spatial- and temporal-dimensions of the input. Although transformer-based models are known to only be effective when large training datasets are available, we show how we can effectively regularise the model during training and leverage pretrained image models to be able to train on comparatively small datasets. We conduct thorough ablation studies, and achieve state-of-the-art results on multiple video classification benchmarks including Kinetics 400 and 600, Epic Kitchens, Something-Something v2 and Moments in Time, outperforming prior methods based on deep 3D convolutional networks. To facilitate further research, we release code at https://github.com/google-research/scenic/tree/main/scenic/projects/vivit
翻译:我们利用这些模型最近在图像分类方面取得的成功,为视频分类提供了纯转化的模型。我们的模型从输入的视频中提取了spatio-时间标志,然后用一系列变压器进行编码。为了处理在视频中遇到的长顺序的标本,我们提出了几种高效的模型变体,将输入的空间和时间分层纳入其中。虽然据了解,以变压器为基础的模型只有在有大型培训数据集的情况下才会有效,但我们展示了我们如何能够在培训和利用预先训练的图像模型时有效地规范该模型,以便能够在相对小的数据集上进行培训。我们进行了彻底的减缩研究,并在多个视频分类基准(包括动因学400和600)、Epic Kitchens、Some-mainting v2 和Moments in Times)上取得了最新的结果,在深层3D革命网络上的表现超过了以往的方法。为了便利进一步研究,我们在https://github.com/gogle-research/scennic/treit/maincisctistrate/mas)上发布了代码。