Temporal action localization is an important yet challenging task in video understanding. Typically, such a task aims at inferring both the action category and localization of the start and end frame for each action instance in a long, untrimmed video.While most current models achieve good results by using pre-defined anchors and numerous actionness, such methods could be bothered with both large number of outputs and heavy tuning of locations and sizes corresponding to different anchors. Instead, anchor-free methods is lighter, getting rid of redundant hyper-parameters, but gains few attention. In this paper, we propose the first purely anchor-free temporal localization method, which is both efficient and effective. Our model includes (i) an end-to-end trainable basic predictor, (ii) a saliency-based refinement module to gather more valuable boundary features for each proposal with a novel boundary pooling, and (iii) several consistency constraints to make sure our model can find the accurate boundary given arbitrary proposals. Extensive experiments show that our method beats all anchor-based and actionness-guided methods with a remarkable margin on THUMOS14, achieving state-of-the-art results, and comparable ones on ActivityNet v1.3. Code is available at https://github.com/TencentYoutuResearch/ActionDetection-AFSD.
翻译:在视频理解中,时间行动本地化是一项重要但具有挑战性的任务。 通常,这种任务的目的是在长长的、不剪裁的视频中推算每个动作的开始和结束框架的行动类别和本地化。 虽然目前大多数模型通过使用预先定义的锚和许多动作取得了良好的结果,但这种方法可能会受到大量产出和不同锚的位置和大小的大幅调整的困扰。相反,没有锚定的方法比较轻,摆脱了多余的超参数,但很少引起注意。在本文中,我们提出了第一个完全无锚定时时间本地化方法,既高效又有效。我们的模型包括(一) 一个端到端可受训的基本预测器,(二) 一个突出的基于功能的完善模块,以收集与新边界集合的每项提案更有价值的边界特征,以及(三) 确保我们的模型能够找到被武断地标的准确边界的若干一致性限制。 广泛的实验表明,我们的方法比所有基于锚定点和动作的指南化方法都大,在THTOOS14上有一个显著的边距。