VVVT:视频视野变换器 (ViViT: A Video Vision Transformer)

We present pure-transformer based models for video classification, drawing upon the recent success of such models in image classification. Our model extracts spatio-temporal tokens from the input video, which are then encoded by a series of transformer layers. In order to handle the long sequences of tokens encountered in video, we propose several, efficient variants of our model which factorise the spatial- and temporal-dimensions of the input. Although transformer-based models are known to only be effective when large training datasets are available, we show how we can effectively regularise the model during training and leverage pretrained image models to be able to train on comparatively small datasets. We conduct thorough ablation studies, and achieve state-of-the-art results on multiple video classification benchmarks including Kinetics 400 and 600, Epic Kitchens, Something-Something v2 and Moments in Time, outperforming prior methods based on deep 3D convolutional networks. To facilitate further research, we will release code and models.

翻译：我们利用这些模型最近在图像分类方面的成功经验,为视频分类提供了纯转换的模型。我们的模型从输入的视频中提取了时空符号,然后用一系列变压器进行编码。为了处理在视频中遇到的代号的长序列,我们提出了几种高效的模型变体,其中考虑到输入的空间和时间差异。虽然以变压器为基础的模型只有在有大型培训数据集的情况下才会有效,但我们展示了我们如何能够在培训和利用经过预先训练的图像模型中有效地使模型正规化,以便能够在相对小的数据集上进行培训。我们进行了彻底的减缩研究,并在多个视频分类基准上取得了最新的结果,包括动因学400和600,Epic Kitchens, Some-contining v2 and Moments in Times, 超越了基于深3D 共变电网络的以往方法。为了便利进一步的研究,我们将发布代码和模型。

相关内容

MoDELS

关注 43

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【ICLR2021】彩色化变换器，Colorization Transformer

专知会员服务

10+阅读 · 2021年2月9日

最新《Transformers模型》教程，64页ppt

专知会员服务

324+阅读 · 2020年11月26日

【Google】多模态Transformer视频检索，Multi-modal Transformer

专知会员服务

103+阅读 · 2020年7月22日

注意力图神经网络的小样本学习

专知会员服务

192+阅读 · 2020年7月16日