Temporal action localization aims to predict the boundary and category of each action instance in untrimmed long videos. Most of previous methods based on anchors or proposals neglect the global-local context interaction in entire video sequences. Besides, their multi-stage designs cannot generate action boundaries and categories straightforwardly. To address the above issues, this paper proposes a novel end-to-end model, called adaptive perception transformer (AdaPerFormer for short). Specifically, AdaPerFormer explores a dual-branch multi-head self-attention mechanism. One branch takes care of the global perception attention, which can model entire video sequences and aggregate global relevant contexts. While the other branch concentrates on the local convolutional shift to aggregate intra-frame and inter-frame information through our bidirectional shift operation. The end-to-end nature produces the boundaries and categories of video actions without extra steps. Extensive experiments together with ablation studies are provided to reveal the effectiveness of our design. Our method achieves a state-of-the-art accuracy on the THUMOS14 dataset (65.8\% in terms of mAP@0.5, 42.6\% mAP@0.7, and 62.7\% mAP@Avg), and obtains competitive performance on the ActivityNet-1.3 dataset with an average mAP of 36.1\%. The code and models are available at https://github.com/SouperO/AdaPerFormer.
翻译:AdaPerFormer在未经剪裁的长视频中预测每个行动实例的边界和类别。以前基于锚点或建议的方法大多忽视了整个视频序列中的全球-当地背景互动。此外,它们的多阶段设计不能直接生成行动界限和类别。为了解决上述问题,本文件提出一个新的端到端模式,称为适应感知变异器(AdaPerformer为短片)。具体地说,AdaPerFormer探索了一种双分支多头自控机制。一个分支关注全球感知关注,它可以模拟整个视频序列和全球相关环境的汇总。另一个分支则侧重于通过我们的双向式变换操作,将局部转换成总体框架内和跨框架信息。端到终端性质在没有额外步骤的情况下生成视频动作的边界和类别。提供广泛的实验和粘贴研究,以揭示我们的设计效果。我们的方法在THUMOS14数据集(65.8-GNG_Q_BAR_BAR_BAR_BAR_BAR_Q_BAR_BAR_BAR_BAR_BAR_BAR_BAR__A______A_BAR_BAR_BAR_BAR_BAR_BAR_BAR_BAR_BAR_BAR_BAR_BAR_BAR_BAR_BAR_BAR_BAR_BAR_BAR_BAR_BAR_BAR_BAR_BAR_BAR_BAR_BAR_BAR_BAR_BAR_BAR___BAR_BAR_BAR_BAR_BAR_BAR_BAR_BAR_BAR_BAR_BAR_BAR_BAR_BAR_BAR_BAR_BAR_BAR_BAR_BAR_BAR_BAR_BAR_BAR_BAR_BAR_BAR_BAR_BAR_BAR_BAR_BAR_BAR_BAR_BAR_BAR_BAR_BAR_BAR___________________BAR_BAR_BAR_BAR_BAR_BAR_BAR____