Temporal Action Detection (TAD) is an essential and challenging topic in video understanding, aiming to localize the temporal segments containing human action instances and predict the action categories. The previous works greatly rely upon dense candidates either by designing varying anchors or enumerating all the combinations of boundaries on video sequences; therefore, they are related to complicated pipelines and sensitive hand-crafted designs. Recently, with the resurgence of Transformer, query-based methods have tended to become the rising solutions for their simplicity and flexibility. However, there still exists a performance gap between query-based methods and well-established methods. In this paper, we identify the main challenge lies in the large variants of action duration and the ambiguous boundaries for short action instances; nevertheless, quadratic-computational global attention prevents query-based methods to build multi-scale feature maps. Towards high-quality temporal action detection, we introduce Sparse Proposals to interact with the hierarchical features. In our method, named SP-TAD, each proposal attends to a local segment feature in the temporal feature pyramid. The local interaction enables utilization of high-resolution features to preserve action instances details. Extensive experiments demonstrate the effectiveness of our method, especially under high tIoU thresholds. E.g., we achieve the state-of-the-art performance on THUMOS14 (45.7% on mAP@0.6, 33.4% on mAP@0.7 and 53.5% on mAP@Avg) and competitive results on ActivityNet-1.3 (32.99% on mAP@Avg). Code will be made available at https://github.com/wjn922/SP-TAD.
翻译:时间行动探测(TAD)是视频理解中一个重要而具有挑战性的主题,目的是将包含人类行动实例的时间段本地化,并预测行动类别。之前的工作在很大程度上依赖密集候选人,或者设计不同的锚或列举视频序列上的所有边界组合;因此,它们与复杂的管道和敏感的手工制作设计有关。最近,随着变压器的恢复,基于查询的方法往往成为其简单性和灵活性的不断上升的解决方案。然而,在基于查询的方法和既定方法之间仍然存在一种绩效差距。在本文中,我们确定的主要挑战在于行动期限的大型变异和短期行动实例的模糊界限;然而,四角转换式全球关注阻止了基于查询的方法来建立多级地貌地图。在高品质的时间探测中,我们引入了微缩建议,在我们的方法中,称为SP-TAD,每一项建议都包含时间特征金字塔中的局部部分特征。本地互动使得高分辨率特征能够用于维护行动实例;但是,在53°A+A上,我们做了高比例的实验。