Temporal action localization aims at localizing action instances from untrimmed videos. Existing works have designed various effective modules to precisely localize action instances based on appearance and motion features. However, by treating these two kinds of features with equal importance, previous works cannot take full advantage of each modality feature, making the learned model still sub-optimal. To tackle this issue, we make an early effort to study temporal action localization from the perspective of multi-modality feature learning, based on the observation that different actions exhibit specific preferences to appearance or motion modality. Specifically, we build a novel structured attention composition module. Unlike conventional attention, the proposed module would not infer frame attention and modality attention independently. Instead, by casting the relationship between the modality attention and the frame attention as an attention assignment process, the structured attention composition module learns to encode the frame-modality structure and uses it to regularize the inferred frame attention and modality attention, respectively, upon the optimal transport theory. The final frame-modality attention is obtained by the composition of the two individual attentions. The proposed structured attention composition module can be deployed as a plug-and-play module into existing action localization frameworks. Extensive experiments on two widely used benchmarks show that the proposed structured attention composition consistently improves four state-of-the-art temporal action localization methods and builds new state-of-the-art performance on THUMOS14. Code is availabel at https://github.com/VividLe/Online-Action-Detection.
翻译:现有作品设计了各种有效的模块,以便根据外观和运动特征对行动实例进行精确的本地化。然而,通过以同等重要性对待这两种特征,以往的作品无法充分利用每种模式特征,使所学模式仍然不理想。为解决这一问题,我们及早努力从多模式特征学习的角度,从多模式特征学习的角度研究时间行动本地化问题。具体地说,我们建立了一个新的结构化关注构成模块。与传统关注不同,拟议的模块不会独立地影响关注和模式。相反,通过将模式关注与框架关注之间的关系作为关注分配过程,结构化关注模块学会将框架-模式结构化结构化结构化结构化结构化结构化结构化结构化结构化结构化结构化结构化结构化结构化结构化结构化结构化结构化结构化结构化结构化结构化结构化结构化结构化结构化结构化结构化结构化结构化结构化结构化结构化结构化结构化结构化结构化结构化结构化结构化结构化结构化结构化结构化结构化结构化结构化结构化结构化结构化结构化结构化结构化结构化结构化结构化结构化结构化结构化结构化结构化结构化结构化结构化结构化结构化结构化结构化结构化结构化结构化结构化结构化结构化结构化结构化结构化结构化结构化结构化结构化结构化模块,系统化结构化结构化结构化结构化结构化结构化结构化结构化结构化结构化结构化结构化结构化结构化结构化结构化结构化结构化结构化结构化结构化结构化结构化结构化结构化结构化结构化结构化结构化结构化结构化结构化结构化结构化结构化结构化结构化结构化结构化结构化结构化结构化结构化结构化结构化结构化结构化结构化结构化结构化结构化结构化结构化结构化结构化结构化结构化结构化结构化结构化结构化结构化结构化结构化结构化结构化结构化结构化结构化结构化结构化结构化结构化结构化结构化结构化结构化结构化结构化结构化结构化结构化结构化结构化结构化结构化结构化结构化结构化结构化结构化结构化结构化结构化结构化结构化结构化结构化结构化结构化结构化结构化结构化结构化结构化结构化