In this paper, we present a one-stage framework TriDet for temporal action detection. Existing methods often suffer from imprecise boundary predictions due to the ambiguous action boundaries in videos. To alleviate this problem, we propose a novel Trident-head to model the action boundary via an estimated relative probability distribution around the boundary. In the feature pyramid of TriDet, we propose an efficient Scalable-Granularity Perception (SGP) layer to mitigate the rank loss problem of self-attention that takes place in the video features and aggregate information across different temporal granularities. Benefiting from the Trident-head and the SGP-based feature pyramid, TriDet achieves state-of-the-art performance on three challenging benchmarks: THUMOS14, HACS and EPIC-KITCHEN 100, with lower computational costs, compared to previous methods. For example, TriDet hits an average mAP of $69.3\%$ on THUMOS14, outperforming the previous best by $2.5\%$, but with only $74.6\%$ of its latency. The code is released to https://github.com/sssste/TriDet.
翻译:在本文中,我们提出了一个用于实时行动探测的单阶段框架TriDet; 现有方法往往由于视频中的模糊行动界限而出现不精确的边界预测; 为了缓解这一问题,我们建议采用一个新的三叉戟头,通过估计的相对概率分布在边界周围来模拟行动边界; 在TriDet的地貌金字塔中,我们建议采用高效的可缩缩-地貌感知(SGP)层,以缓解视频特征和不同时间粒子的综合信息中发生的自我注意的级别损失问题; TriDet从三叉头和基于SGP的地貌金字塔中受益,TriDet在三个具有挑战性的基准(THUMOOS14、HACS和EPIC-KITCHEN 100)上取得最新业绩,计算成本比以往方法低。例如,TriDitt在THUMOOS14上平均可点击69.3 $ 美元 mAP,比前一个最高值高出2.5 美元,但只有74.6 美元。 该代码被发布到 https://github.com/sste.</s>