Action visual tempo characterizes the dynamics and the temporal scale of an action, which is helpful to distinguish human actions that share high similarities in visual dynamics and appearance. Previous methods capture the visual tempo either by sampling raw videos with multiple rates, which require a costly multi-layer network to handle each rate, or by hierarchically sampling backbone features, which rely heavily on high-level features that miss fine-grained temporal dynamics. In this work, we propose a Temporal Correlation Module (TCM), which can be easily embedded into the current action recognition backbones in a plug-in-and-play manner, to extract action visual tempo from low-level backbone features at single-layer remarkably. Specifically, our TCM contains two main components: a Multi-scale Temporal Dynamics Module (MTDM) and a Temporal Attention Module (TAM). MTDM applies a correlation operation to learn pixel-wise fine-grained temporal dynamics for both fast-tempo and slow-tempo. TAM adaptively emphasizes expressive features and suppresses inessential ones via analyzing the global information across various tempos. Extensive experiments conducted on several action recognition benchmarks, e.g. Something-Something V1 $\&$ V2, Kinetics-400, UCF-101, and HMDB-51, have demonstrated that the proposed TCM is effective to promote the performance of the existing video-based action recognition models for a large margin. The source code is publicly released at https://github.com/yzfly/TCM.
翻译:动作的视觉节奏是动作的动态和时间尺度的特点,它有助于区分在视觉动态和外观中具有高度相似性的人的行动。 以往的方法通过以多种比例抽样原始视频(这需要花费昂贵的多层网络来处理每个比例)或按等级抽样主干特征(它们严重依赖高层次特征,而高层次特征与细微时间动态不相符)来捕捉视觉节奏。 在这项工作中,我们提议一个可以很容易地嵌入当前行动识别主干柱的时温模块(TCM),该模块可以以插插插式和动作的方式,从单层低层骨干特征中提取动作的视觉节奏。 具体而言,我们的TCMM包含两个主要组成部分:多尺度的时空动态模块(MTDM)和温度关注模块(TAM)。 MTDM运用一个相关操作来学习快速和缓慢时间动态。 Tempo(TM) 以插式插入方式,TAM以插式的方式强调直观性特征和抑制。 通过分析全球信息,在T级平流中的低位平流中,101MDM(H-M)进行若干动作识别动作基准的大规模实验。