First-person action recognition is a challenging task in video understanding. Because of strong ego-motion and a limited field of view, many backgrounds or noisy frames in a first-person video can distract an action recognition model during its learning process. To encode more discriminative features, the model needs to have the ability to focus on the most relevant part of the video for action recognition. Previous works explored to address this problem by applying temporal attention but failed to consider the global context of the full video, which is critical for determining the relatively significant parts. In this work, we propose a simple yet effective Stacked Temporal Attention Module (STAM) to compute temporal attention based on the global knowledge across clips for emphasizing the most discriminative features. We achieve this by stacking multiple self-attention layers. Instead of naive stacking, which is experimentally proven to be ineffective, we carefully design the input to each self-attention layer so that both the local and global context of the video is considered during generating the temporal attention weights. Experiments demonstrate that our proposed STAM can be built on top of most existing backbones and boost the performance in various datasets.
翻译:在视频理解方面,第一人的行动识别是一项艰巨的任务。由于强烈的自我感动和有限的视野领域,第一人视频中的许多背景或噪音框架可以分散第一人视频中一个行动识别模型的注意力。要将更具歧视性的特征编码起来,该模型需要有能力侧重于视频中最相关的部分,以行动识别。以前曾探讨过如何通过时间关注来解决这一问题,但没有考虑到全视频的全球背景,而全视频对于确定相对重要部分至关重要。在这项工作中,我们提议了一个简单而有效的时尚关注模块(STAM),以基于全球知识的全局知识来计算时间关注时间,强调最有歧视性的特征。我们通过堆叠多个自省层来实现这一点。我们不是尝试证明无效的天真的堆叠,而是仔细设计对每个自留层的投入,以便在产生时间关注重量时既考虑到视频的当地和全球背景。实验表明,我们提议的STAM可以建在大多数现有骨架的顶端上,提高各种数据集的性能。