Motion, as the uniqueness of a video, has been critical to the development of video understanding models. Modern deep learning models leverage motion by either executing spatio-temporal 3D convolutions, factorizing 3D convolutions into spatial and temporal convolutions separately, or computing self-attention along temporal dimension. The implicit assumption behind such successes is that the feature maps across consecutive frames can be nicely aggregated. Nevertheless, the assumption may not always hold especially for the regions with large deformation. In this paper, we present a new recipe of inter-frame attention block, namely Stand-alone Inter-Frame Attention (SIFA), that novelly delves into the deformation across frames to estimate local self-attention on each spatial location. Technically, SIFA remoulds the deformable design via re-scaling the offset predictions by the difference between two frames. Taking each spatial location in the current frame as the query, the locally deformable neighbors in the next frame are regarded as the keys/values. Then, SIFA measures the similarity between query and keys as stand-alone attention to weighted average the values for temporal aggregation. We further plug SIFA block into ConvNets and Vision Transformer, respectively, to devise SIFA-Net and SIFA-Transformer. Extensive experiments conducted on four video datasets demonstrate the superiority of SIFA-Net and SIFA-Transformer as stronger backbones. More remarkably, SIFA-Transformer achieves an accuracy of 83.1% on Kinetics-400 dataset. Source code is available at \url{https://github.com/FuchenUSTC/SIFA}.
翻译:作为视频的独特性,对于视频理解模型的发展至关重要。现代深层学习模型通过执行时空三维演进,将三维演进分解成空间和时间演进,或者根据时间层面计算自我自留。这些成功背后的隐含假设是,连续框架的地貌图可以很好地加以汇总。然而,这一假设可能并不总是特别适用于出现大变形的区域。在本文中,我们提出了一种新的框架间关注区块的配方,即独立方际注意(SIFA),这种模式以新颖的方式进入框架间变异,以估计每个空间位置的局部和时间级变异,将三维演进到空间和时间层面的局部自我保持,从技术上说,SIFA通过重新缩放,将当前框架中的每个空间位置视为关键/价值。然后,SIFA的查询和键间对内部基流流流流流数据进行进一步测量,将SIFA系统-直流的直径直径直径直流直径直径直径,将SIFA的S-直径直径直径直径,将S-IFA的S-FA的S-IFA数据转换为S-IFA的基FA的系统-直径直径流数据转换为S-IFA的基FA 和S-I-I-I-I-I-I-I-S-I-I-I-I-I-I-I-I-I-I-S-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-