Existing temporal action detection (TAD) methods rely on generating an overwhelmingly large number of proposals per video. This leads to complex model designs due to proposal generation and/or per-proposal action instance evaluation and the resultant high computational cost. In this work, for the first time, we propose a proposal-free Temporal Action detection model with Global Segmentation mask (TAGS). Our core idea is to learn a global segmentation mask of each action instance jointly at the full video length. The TAGS model differs significantly from the conventional proposal-based methods by focusing on global temporal representation learning to directly detect local start and end points of action instances without proposals. Further, by modeling TAD holistically rather than locally at the individual proposal level, TAGS needs a much simpler model architecture with lower computational cost. Extensive experiments show that despite its simpler design, TAGS outperforms existing TAD methods, achieving new state-of-the-art performance on two benchmarks. Importantly, it is ~ 20x faster to train and ~1.6x more efficient for inference. Our PyTorch implementation of TAGS is available at https://github.com/sauradip/TAGS .
翻译:现有时间行动探测方法依靠每个视频生成大量提议,这导致由于产生建议和(或)每个提案行动实例的评价以及由此产生的高计算成本而产生复杂的模型设计。在这项工作中,我们首次提出一个无建议的时间行动探测模型,带有全球分割面罩(TAGS)。我们的核心想法是在整个视频长度上共同学习每个行动实例的全球分割面罩。TAGS模型与传统的基于建议的方法大不相同,侧重于全球时间代表学习,直接检测当地开始和结束行动点,而没有建议。此外,通过在单个提案级别上以整体方式而不是以当地方式模拟TAAD,TAGS需要更简单得多的模型结构,计算成本较低。广泛的实验表明,尽管设计更简单,TAGS超越了现有的TAD方法,在两个基准上实现新的状态-艺术性能。重要的是,培训速度要快20x,而更高效的~1.6x。我们的TAGS/TARADI/TA。