Temporal modeling is crucial for various video learning tasks. Most recent approaches employ either factorized (2D+1D) or joint (3D) spatial-temporal operations to extract temporal contexts from the input frames. While the former is more efficient in computation, the latter often obtains better performance. In this paper, we attribute this to a dilemma between the sufficiency and the efficiency of interactions among various positions in different frames. These interactions affect the extraction of task-relevant information shared among frames. To resolve this issue, we prove that frame-by-frame alignments have the potential to increase the mutual information between frame representations, thereby including more task-relevant information to boost effectiveness. Then we propose Alignment-guided Temporal Attention (ATA) to extend 1-dimensional temporal attention with parameter-free patch-level alignments between neighboring frames. It can act as a general plug-in for image backbones to conduct the action recognition task without any model-specific design. Extensive experiments on multiple benchmarks demonstrate the superiority and generality of our module.
翻译:时间建模对于各种视频学习任务至关重要。 多数最新方法都采用因数化(2D+1D)或联合(3D)空间时空操作来从输入框中提取时间背景。 虽然前者在计算时效率更高, 但后者往往能取得更好的性能。 在本文中, 我们将此归因于不同框中各位置之间互动的充足性和效率之间的两难境地。 这些互动会影响各框架之间共享的任务相关信息的提取。 要解决这个问题, 我们证明框架对齐有可能增加框架代表之间的相互信息, 从而包括更多的任务相关信息来提高有效性。 然后我们提议对齐- 制导时空注意(ATA) 扩大一维时间关注, 并在相邻框之间无参数的跨位匹配上进行 。 它可以作为图像主干部的一般插件, 在没有任何模型设计的情况下执行行动识别任务。 在多个基准上进行广泛的实验, 显示了我们模块的优越性和一般性 。