Self-attention learns pairwise interactions to model long-range dependencies, yielding great improvements for video action recognition. In this paper, we seek a deeper understanding of self-attention for temporal modeling in videos. We first demonstrate that the entangled modeling of spatio-temporal information by flattening all pixels is sub-optimal, failing to capture temporal relationships among frames explicitly. To this end, we introduce Global Temporal Attention (GTA), which performs global temporal attention on top of spatial attention in a decoupled manner. We apply GTA on both pixels and semantically similar regions to capture temporal relationships at different levels of spatial granularity. Unlike conventional self-attention that computes an instance-specific attention matrix, GTA directly learns a global attention matrix that is intended to encode temporal structures that generalize across different samples. We further augment GTA with a cross-channel multi-head fashion to exploit channel interactions for better temporal modeling. Extensive experiments on 2D and 3D networks demonstrate that our approach consistently enhances temporal modeling and provides state-of-the-art performance on three video action recognition datasets.
翻译:自我关注学会了模拟长距离依赖的对称互动,从而极大地改进了视频动作识别。 在本文中,我们寻求更深入地了解在视频中进行时间模型模拟的自我关注。 我们首先通过平整所有像素来显示磁点-时空信息的纠缠模型是次优化的,未能明确捕捉各框架之间的时间关系。 为此,我们引入了全球时间关注(GTA),它以脱钩的方式,在空间注意力的顶端以分解方式进行全球时间关注。我们在像素和语义相似的区域应用GTA来捕捉不同空间颗粒度水平的时际关系。与传统的自我关注模式不同的是,GTA直接学习一个全球关注矩阵,目的是将跨越不同样本的时际结构编码。我们进一步以跨通道多头的方式扩大GTA,以利用频道的交互作用来改进时间模型。 我们对2D和3D网络的广泛实验表明,我们的方法持续地加强时间模型的时空模型,并提供状态的视频动作表现。