This paper presents VTN, a transformer-based framework for video recognition. Inspired by recent developments in vision transformers, we ditch the standard approach in video action recognition that relies on 3D ConvNets and introduce a method that classifies actions by attending to the entire video sequence information. Our approach is generic and builds on top of any given 2D spatial network. In terms of wall runtime, it trains $16.1\times$ faster and runs $5.1\times$ faster during inference while maintaining competitive accuracy compared to other state-of-the-art methods. It enables whole video analysis, via a single end-to-end pass, while requiring $1.5\times$ fewer GFLOPs. We report competitive results on Kinetics-400 and present an ablation study of VTN properties and the trade-off between accuracy and inference speed. We hope our approach will serve as a new baseline and start a fresh line of research in the video recognition domain. Code and models will be available soon.
翻译:本文介绍了VTN, 这是一个基于变压器的视频识别框架。 受视觉变压器最新动态的启发, 我们放弃了依赖 3D ConvNet 的视频动作识别标准方法, 并引入了一种通过关注整个视频序列信息对行动进行分类的方法。 我们的方法是通用的, 建立在任何给定的 2D 空间网络之上。 在墙运行时间方面, 它培训了16.1 美元, 在推断过程中速度更快, 运行了5.1 美元, 同时保持了与其他最新方法相比的竞争性精确度。 它通过单一端到端通行证, 使得整个视频分析得以进行, 同时需要1.5美元, 更少的GFLOPs 。 我们报告了Kinitics-400 上的竞争性结果, 并介绍了VTN属性以及精确速度和推算速度之间的折算研究。 我们希望我们的方法将作为一个新的基线, 并开始在视频识别领域开展新的研究。 代码和模型将很快可用。