The task of action detection aims at deducing both the action category and localization of the start and end moment for each action instance in a long, untrimmed video. While vision Transformers have driven the recent advances in video understanding, it is non-trivial to design an efficient architecture for action detection due to the prohibitively expensive self-attentions over a long sequence of video clips. To this end, we present an efficient hierarchical Spatio-Temporal Pyramid Transformer (STPT) for action detection, building upon the fact that the early self-attention layers in Transformers still focus on local patterns. Specifically, we propose to use local window attention to encode rich local spatio-temporal representations in the early stages while applying global attention modules to capture long-term space-time dependencies in the later stages. In this way, our STPT can encode both locality and dependency with largely reduced redundancy, delivering a promising trade-off between accuracy and efficiency. For example, with only RGB input, the proposed STPT achieves 53.6% mAP on THUMOS14, surpassing I3D+AFSD RGB model by over 10% and performing favorably against state-of-the-art AFSD that uses additional flow features with 31% fewer GFLOPs, which serves as an effective and efficient end-to-end Transformer-based framework for action detection.
翻译:行动探测任务旨在用一个长长的、不剪动的视频,对每个行动实例的开始和结束时刻的动作类别和地点化进行分解。虽然愿景变异器驱动了视频理解方面的最新进展,但由于在一连串视频剪辑中,自我关注费用高昂,因此设计一个高效的行动探测架构并非三进三进四,因为长时间的视频剪辑片片片段的自我关注费用过高,因此,我们提出一个高效的Spatio-时空金字形变异器(STPT)等级,供行动探测,因为变异器的早期自我关注层仍然侧重于当地模式。具体地说,我们提议利用当地窗口关注点在早期将丰富的当地时空表现编码成丰富的视频,同时运用全球关注模块来捕捉长期空间时间依赖性。在这方面,我们的STPT可以将地点和依赖性两者都编码为大为减少的冗余,在准确和效率之间实现有希望的贸易。例如,拟议的STPT在TUMOS14上实现了53.6%的 mAP,在早期对THOOP14的快速测试模式上,在应用I3D-F-FDFD的更多动作模式,在31的节流-SG-SG-SD中以降低的FD-SD-SDFDFD-F-SD-SD-FDFDFDFDFD-FD-FD-FD-FD-FD-F-F-F-FD-FD-FD-FD-FD-FD-FD-FD-FD-FD-FD-FD-FD-FD-FD-FD-FD-FD-FD-FD-FD-FD-FD-FD-FD-FD-FD-FD-FD-FD-FD-FD-FD-FD-FD-FD-FD-FD-FD-FD-F-FD-FD-F-F-F-F-F-F-F-F-F-F-F-F-F-F-F-F-F-F-F-F-F-F-F-F-F-F-