Temporal action localization plays an important role in video analysis, which aims to localize and classify actions in untrimmed videos. The previous methods often predict actions on a feature space of a single-temporal scale. However, the temporal features of a low-level scale lack enough semantics for action classification while a high-level scale cannot provide rich details of the action boundaries. To address this issue, we propose to predict actions on a feature space of multi-temporal scales. Specifically, we use refined feature pyramids of different scales to pass semantics from high-level scales to low-level scales. Besides, to establish the long temporal scale of the entire video, we use a spatial-temporal transformer encoder to capture the long-range dependencies of video frames. Then the refined features with long-range dependencies are fed into a classifier for the coarse action prediction. Finally, to further improve the prediction accuracy, we propose to use a frame-level self attention module to refine the classification and boundaries of each action instance. Extensive experiments show that the proposed method can outperform state-of-the-art approaches on the THUMOS14 dataset and achieves comparable performance on the ActivityNet1.3 dataset. Compared with A2Net (TIP20, Avg\{0.3:0.7\}), Sub-Action (CSVT2022, Avg\{0.1:0.5\}), and AFSD (CVPR21, Avg\{0.3:0.7\}) on the THUMOS14 dataset, the proposed method can achieve improvements of 12.6\%, 17.4\% and 2.2\%, respectively
翻译:时间行动本地化在视频分析中起着重要作用。 视频分析的目的是将未剪动视频中的行动本地化和分类。 先前的方法经常预测单一时空尺度特征空间的动作。 然而, 低级别规模的时间特征缺乏足够行动分类的语义, 而高级别规模则无法提供行动界限的丰富细节。 为了解决这一问题, 我们提议预测多时尺度特征空间上的行动。 具体地说, 我们使用不同尺度的精细特质金字塔来将高尺度的语义从高尺度降至低尺度。 此外, 为了确定整个视频的长时间尺度, 我们使用空间时空变异变异器来捕捉视频框的长距离依赖性。 然后, 高尺度的精细化功能无法提供复杂行动界限的精细细节。 最后, 为了进一步提高预测准确性, 我们提议使用一个17级自关注模块来完善每个动作实例的分类和范围。 广泛的实验表明, 拟议的方法可以超越全级视频变异时间尺度的状态、 V- PR 3 和 AV-0- Net 工具 实现可比较性数据: AV- hal- hal- hal- ta: AV.