Transformer models have shown great success handling long-range interactions, making them a promising tool for modeling video. However they lack inductive biases and scale quadratically with input length. These limitations are further exacerbated when dealing with the high dimensionality introduced with the temporal dimension. While there are surveys analyzing the advances of Transformers for vision, none focus on an in-depth analysis of video-specific designs. In this survey we analyze main contributions and trends of works leveraging Transformers to model video. Specifically, we delve into how videos are handled as input-level first. Then, we study the architectural changes made to deal with video more efficiently, reduce redundancy, re-introduce useful inductive biases, and capture long-term temporal dynamics. In addition we provide an overview of different training regimes and explore effective self-supervised learning strategies for video. Finally, we conduct a performance comparison on the most common benchmark for Video Transformers (i.e., action classification), finding them to outperform 3D ConvNets even with less computational complexity.
翻译:变换模型显示在处理长程互动方面非常成功,使它们成为建模视频的一个很有希望的工具。 但是,它们缺乏感应偏差,且在输入长度方面缺乏规模之分。 当处理与时间层面相关的高维度时,这些限制会进一步加剧。 虽然有些调查分析了变换器在视觉方面的进步,但没有一项调查侧重于对视像特定设计进行深入分析。 在这项调查中,我们分析了利用变换器进行模拟视频的主要贡献和趋势。 具体地说, 我们仔细研究了如何首先将视频作为投入级别来处理。 然后,我们研究了为更有效地处理视频、减少冗余、重新生成有用的感应偏差和捕捉长期时间动态而做出的建筑变化。 此外,我们概述了不同的培训制度,并探索了有效的自我监督的视频学习战略。 最后,我们比较了视频变换器最常用的基准(即行动分类)的业绩,发现它们比3D ConNet更差,即使计算复杂性更低。