Self-attention based Transformer models have demonstrated impressive results for image classification and object detection, and more recently for video understanding. Inspired by this success, we investigate the application of Transformer networks for temporal action localization in videos. To this end, we present ActionFormer -- a simple yet powerful model to identify actions in time and recognize their categories in a single shot, without using action proposals or relying on pre-defined anchor windows. ActionFormer combines a multiscale feature representation with local self-attention, and uses a light-weighted decoder to classify every moment in time and estimate the corresponding action boundaries. We show that this orchestrated design results in major improvements upon prior works. Without bells and whistles, ActionFormer achieves 65.6% mAP at tIoU=0.5 on THUMOS14, outperforming the best prior model by 8.7 absolute percentage points and crossing the 60% mAP for the first time. Further, ActionFormer demonstrates strong results on ActivityNet 1.3 (36.0% average mAP) and the more recent EPIC-Kitchens 100 (+13.5% average mAP over prior works). Our code is available at http://github.com/happyharrycn/actionformer_release
翻译:以自我关注为基础的变异器模型在图像分类和对象探测方面,以及最近为视频理解方面,显示了令人印象深刻的成果。受这一成功启发,我们调查了变异器网络在视频中的时间动作定位应用情况。为此,我们展示了ActionFormer -- -- 一个简单而有力的模型,可以及时识别行动,在不使用行动建议或依赖预先定义的锁定窗口的情况下,在一个镜头中识别其类别。ActionFormer将多尺度特征代表与当地自我关注结合起来,并使用轻量的解码器对每个时刻进行分类,并估计相应的行动界限。我们展示了先前工作的重大改进的这一精心设计结果。没有钟声和哨音,ActionFormer在THUOU=0.5上实现了65.6%的 mAP,在8.7 绝对百分点上比前最佳模型表现得更好,第一次跨过60% mAP。此外,ActionFormer展示了活动网1.3(36.0%的平均 mAP)和最近的EPIC-Kitchens 100(+13.5%的平均AP_SGrent arrent preal) 。我们的代码在前工作上是可用的。我们可用的代码。