Video transformer naturally incurs a heavier computation burden than a static vision transformer, as the former processes $T$ times longer sequence than the latter under the current attention of quadratic complexity $(T^2N^2)$. The existing works treat the temporal axis as a simple extension of spatial axes, focusing on shortening the spatio-temporal sequence by either generic pooling or local windowing without utilizing temporal redundancy. However, videos naturally contain redundant information between neighboring frames; thereby, we could potentially suppress attention on visually similar frames in a dilated manner. Based on this hypothesis, we propose the LAPS, a long-term ``\textbf{\textit{Leap Attention}}'' (LA), short-term ``\textbf{\textit{Periodic Shift}}'' (\textit{P}-Shift) module for video transformers, with $(2TN^2)$ complexity. Specifically, the ``LA'' groups long-term frames into pairs, then refactors each discrete pair via attention. The ``\textit{P}-Shift'' exchanges features between temporal neighbors to confront the loss of short-term dynamics. By replacing a vanilla 2D attention with the LAPS, we could adapt a static transformer into a video one, with zero extra parameters and neglectable computation overhead ($\sim$2.6\%). Experiments on the standard Kinetics-400 benchmark demonstrate that our LAPS transformer could achieve competitive performances in terms of accuracy, FLOPs, and Params among CNN and transformer SOTAs. We open-source our project in \sloppy \href{https://github.com/VideoNetworks/LAPS-transformer}{\textit{\color{magenta}{https://github.com/VideoNetworks/LAPS-transformer}}} .
翻译:视频变压器自然产生比静态视觉变压器更重的计算负担 { 视频变压器自然比静态变压器产生比静态变压器更重的计算负担, 因为先前的变压器在目前对二次复杂度的注意下处理比后者长1倍( T=2N=2美元) 。 现有的作品将时间轴作为空间轴的简单延伸, 重点是通过通用集合或本地窗口来缩短时空序列, 而不使用时间冗余。 然而, 视频自然包含相邻框架之间的冗余信息 ; 因此, 我们可能会以变压的方式在视觉相似的框上抑制关注。 基于这个假设, 我们提议长期的LAPS, 长期的LP- textb=Textf=Leap RetleasionA'( LA), 短期的 textbrefreadbralbral disality a more liforal listal lives, listal- demodal liversal liversal laft lavelts.