The introduction of Transformer model has led to tremendous advancements in sequence modeling, especially in text domain. However, the use of attention-based models for video understanding is still relatively unexplored. In this paper, we introduce Gated Adversarial Transformer (GAT) to enhance the applicability of attention-based models to videos. GAT uses a multi-level attention gate to model the relevance of a frame based on local and global contexts. This enables the model to understand the video at various granularities. Further, GAT uses adversarial training to improve model generalization. We propose temporal attention regularization scheme to improve the robustness of attention modules to adversarial examples. We illustrate the performance of GAT on the large-scale YoutTube-8M data set on the task of video categorization. We further show ablation studies along with quantitative and qualitative analysis to showcase the improvement.
翻译:采用变换模型在序列建模方面取得了巨大进步,特别是在文本领域。然而,对视像理解使用以关注为基础的模型仍然相对没有进行探索。在本文中,我们引入了Gated Aversarial变异器(GAT)以加强关注型模型对视频的可适用性。GAT利用一个多层关注门来模拟基于当地和全球背景的框架的相关性。这使该模型能够理解不同微粒的视频。此外,GAT利用对抗性培训来改进模型的概括化。我们提出了时间关注监管计划,以提高对对抗性实例的关注模块的稳健性。我们举例说明了GAT在大型Youtube-8M视频分类任务数据集上的表现。我们进一步展示了与定量和定性分析相结合的研究,以展示改进情况。