Self-attention based Transformer models have demonstrated impressive results for image classification and object detection, and more recently for video understanding. Inspired by this success, we investigate the application of Transformer networks for temporal action localization in videos. To this end, we present ActionFormer -- a simple yet powerful model to identify actions in time and recognize their categories in a single shot, without using action proposals or relying on pre-defined anchor windows. ActionFormer combines a multiscale feature representation with local self-attention, and uses a light-weighted decoder to classify every moment in time and estimate the corresponding action boundaries. We show that this orchestrated design results in major improvements upon prior works. Without bells and whistles, ActionFormer achieves 71.0% mAP at tIoU=0.5 on THUMOS14, outperforming the best prior model by 14.1 absolute percentage points. Further, ActionFormer demonstrates strong results on ActivityNet 1.3 (36.6% average mAP) and EPIC-Kitchens 100 (+13.5% average mAP over prior works). Our code is available at http://github.com/happyharrycn/actionformer_release.
翻译:以自我关注为基础的变异器模型在图像分类和对象探测方面,以及最近为视频理解方面,已经展示了令人印象深刻的成果。受这一成功启发,我们调查了变异器网络在视频中时间动作定位的应用。为此,我们展示了ActionFormer -- -- 一个简单而有力的模型,可以及时识别行动,在不使用行动建议或依赖预先定义的锁定窗口的情况下,在一个镜头中识别其类别。ActionFormer将多尺度的特征代表与当地自我意识结合起来,并使用轻量的解码器对每个时刻进行分类,并估计相应的行动界限。我们展示了先前工程的重大改进的这一精心设计结果。没有钟声和哨,ActionFormer在THUOU=0.5上实现了71.0%的 mAP,比前最佳模型的绝对百分点高14.1。此外,ActionFormer展示了活动网1.3(36.6%的平均 mAP)和EPIC-Kitchens 100(+13.5 %的平均 mAP)的成绩。我们的代码可在http://gith_slievaction.com_hryry/ atraction.