Transformer networks are effective at modeling long-range contextual information and have recently demonstrated exemplary performance in the natural language processing domain. Conventionally, the temporal action proposal generation (TAPG) task is divided into two main sub-tasks: boundary prediction and proposal confidence prediction, which rely on the frame-level dependencies and proposal-level relationships separately. To capture the dependencies at different levels of granularity, this paper intuitively presents a unified temporal action proposal generation framework with original Transformers, called TAPG Transformer, which consists of a Boundary Transformer and a Proposal Transformer. Specifically, the Boundary Transformer captures long-term temporal dependencies to predict precise boundary information and the Proposal Transformer learns the rich inter-proposal relationships for reliable confidence evaluation. Extensive experiments are conducted on two popular benchmarks: ActivityNet-1.3 and THUMOS14, and the results demonstrate that TAPG Transformer outperforms state-of-the-art methods. Equipped with the existing action classifier, our method achieves remarkable performance on the temporal action localization task. Codes and models will be available.
翻译:变换器网络在长距离背景信息建模方面是有效的,最近还展示了自然语言处理领域的模范性业绩。 具体地说,时间行动提案生成(TAPG)任务分为两个主要子任务:边界预测和提议信任预测,这分别依赖框架一级的依赖和提议一级的关系。为了捕捉不同颗粒度的依赖性,本文直观地提出了一个统一的时间行动提案生成框架,其原始变换器称为TAPG 变换器,由边界变换器和提议变换器组成。具体地说,边界变换器捕捉到预测准确边界信息的长期时间依赖性,而提议变换器学习可靠信任评价的丰富的提议间关系。对两种流行基准进行了广泛的实验:活动Net-1.3和THUMOS14,结果显示,TAPG变换器超越了最先进的方法。与现有行动分类器相比,我们的方法在时间行动本地化任务上取得了显著的成绩。代码和模型将可供使用。