Temporal action localization aims to predict the boundary and category of each action instance in untrimmed long videos. Most of previous methods based on anchors or proposals neglect the global-local context interaction in entire video sequences. Besides, their multi-stage designs cannot generate action boundaries and categories straightforwardly. To address the above issues, this paper proposes a end-to-end model, called Adaptive Perception transformer (AdaPerFormer for short). Specifically, AdaPerFormer explores a dual-branch attention mechanism. One branch takes care of the global perception attention, which can model entire video sequences and aggregate global relevant contexts. While the other branch concentrates on the local convolutional shift to aggregate intra-frame and inter-frame information through our bidirectional shift operation. The end-to-end nature produces the boundaries and categories of video actions without extra steps. Extensive experiments together with ablation studies are provided to reveal the effectiveness of our design. Our method obtains competitive performance on the THUMOS14 and ActivityNet-1.3 dataset.
翻译:时间行动本地化的目的是在未剪切的长视频中预测每个行动实例的边界和类别。以前基于锚或建议的方法大多忽视了整个视频序列中的全球-当地背景互动。 此外,它们的多阶段设计不能直接产生行动界限和类别。为了解决上述问题,本文件提议了一个端到端模式,称为适应感知变异器(AdaPerFormer为短片)。具体地说,AdaPerFormer探索了一个双部门关注机制。一个分支关注全球感知,它可以模拟整个视频序列和综合全球相关背景。而另一个分支则侧重于通过我们的双向转移操作,将局部向总体框架内和框架间信息转变。端到端性质在没有额外步骤的情况下生成视频行动的边界和类别。提供了广泛的实验和实验研究,以揭示我们的设计效果。我们的方法在THUMOS14和活动网络1.3数据集上取得了竞争性的性能。