We present a convolution-free approach to video classification built exclusively on self-attention over space and time. Our method, named "TimeSformer," adapts the standard Transformer architecture to video by enabling spatiotemporal feature learning directly from a sequence of frame-level patches. Our experimental study compares different self-attention schemes and suggests that "divided attention," where temporal attention and spatial attention are separately applied within each block, leads to the best video classification accuracy among the design choices considered. Despite the radically new design, TimeSformer achieves state-of-the-art results on several action recognition benchmarks, including the best reported accuracy on Kinetics-400 and Kinetics-600. Finally, compared to 3D convolutional networks, our model is faster to train, it can achieve dramatically higher test efficiency (at a small drop in accuracy), and it can also be applied to much longer video clips (over one minute long). Code and models are available at: https://github.com/facebookresearch/TimeSformer.
翻译:我们对完全基于对空间和时间的自留的视频分类提出了一种无革命性的方法。我们的方法名为“TimeSext ”, 将标准变异器结构调整为视频, 使超时特征能够直接从框架级补丁序列中学习。 我们的实验研究比较了不同的自留计划, 并表明“ 集中关注 ”, 在每个区块中分别应用时间关注和空间关注, 使得所考虑的设计选择具有最佳的视频分类准确性。 尽管有了全新的设计, TimeSex 取得了几项行动识别基准的最新结果, 其中包括关于动因学- 400 和 动因学- 600 的最佳报告精确度。 最后, 与 3D 进化网络相比, 我们的模型培训速度更快, 它可以达到极高的测试效率( 精度下降一小点 ), 还可以适用于更长的视频剪辑( 超过一分钟 ) 。 代码和模型可以在 https://github. com/facebookresear/Timestrefear/TimeSeforth) 。