Transformer models have shown great success handling long-range interactions, making them a promising tool for modeling video. However, they lack inductive biases and scale quadratically with input length. These limitations are further exacerbated when dealing with the high dimensionality introduced by the temporal dimension. While there are surveys analyzing the advances of Transformers for vision, none focus on an in-depth analysis of video-specific designs. In this survey, we analyze the main contributions and trends of works leveraging Transformers to model video. Specifically, we delve into how videos are handled at the input level first. Then, we study the architectural changes made to deal with video more efficiently, reduce redundancy, re-introduce useful inductive biases, and capture long-term temporal dynamics. In addition, we provide an overview of different training regimes and explore effective self-supervised learning strategies for video. Finally, we conduct a performance comparison on the most common benchmark for Video Transformers (i.e., action classification), finding them to outperform 3D ConvNets even with less computational complexity.
翻译:变换模型已经展示出处理长距离互动的巨大成功,使它们成为建模视频的一个很有希望的工具。 但是,它们缺乏感应偏差,并且用输入长度来进行缩放。 当处理由时间因素带来的高度维度时,这些局限性会进一步加剧。 虽然有些调查分析了变换器的进向, 分析的视野, 但没有一项调查侧重于对视像特定设计进行深入分析。 在这项调查中, 我们分析了利用变换器模拟视频的作品的主要贡献和趋势。 具体地说, 我们先研究如何在输入层面上处理视频。 然后, 我们研究为更有效地处理视频、 减少冗余、 重新生成有用的缩放偏差和捕捉长期时间动态而做出的建筑变化。 此外, 我们提供了不同培训机制的概览, 并探索了视频的有效自我监督学习战略。 最后, 我们比较了视频变换器最常用的基准( e. 行动分类) 的绩效, 发现它们比3D ConNet更差, 甚至是计算的复杂性。