Attention mechanisms have significantly boosted the performance of video classification neural networks thanks to the utilization of perspective contexts. However, the current research on video attention generally focuses on adopting a specific aspect of contexts (e.g., channel, spatial/temporal, or global context) to refine the features and neglects their underlying correlation when computing attentions. This leads to incomplete context utilization and hence bears the weakness of limited performance improvement. To tackle the problem, this paper proposes an efficient attention-in-attention (AIA) method for element-wise feature refinement, which investigates the feasibility of inserting the channel context into the spatio-temporal attention learning module, referred to as CinST, and also its reverse variant, referred to as STinC. Specifically, we instantiate the video feature contexts as dynamics aggregated along a specific axis with global average and max pooling operations. The workflow of an AIA module is that the first attention block uses one kind of context information to guide the gating weights calculation of the second attention that targets at the other context. Moreover, all the computational operations in attention units act on the pooled dimension, which results in quite few computational cost increase ($<$0.02\%). To verify our method, we densely integrate it into two classical video network backbones and conduct extensive experiments on several standard video classification benchmarks. The source code of our AIA is available at \url{https://github.com/haoyanbin918/Attention-in-Attention}.
翻译:由于利用了视角背景,关注机制大大提高了视频神经网络分类的性能,然而,目前对视频关注的研究一般侧重于采用特定背景(如频道、空间/时空或全球背景)来完善特征,在计算注意力时忽视其内在关联性,导致背景利用不完全,因此也存在绩效改进有限的弱点。为解决这一问题,本文件建议对元素性能进行精细改进,采用一种高效的 " 关注 " 方法,以调查将频道内容插入时空关注学习模块的可行性,称为 " CinST ",以及其反向变式,称为 " STinC " 。具体地说,我们将视频特征背景作为动态,与全球平均和最大集中操作的具体轴一起汇总。AIA模块的工作流程是,第一个关注区使用一种背景信息来指导对目标在另一背景下的第二次关注量的加权计算。此外,所有关注单位的计算操作单位在集合值层面,称为 " CinST,还有称为 " CinST " 以及其反向变量,称为STinC。具体地,我们将视频特征背景环境环境环境环境环境环境环境环境环境环境环境环境环境环境环境环境环境环境背景环境环境环境环境环境环境进行汇总整合,将结合结合结合,将我们现有的两个计算成本计算。