Spatial-temporal, channel-wise, and motion patterns are three complementary and crucial types of information for video action recognition. Conventional 2D CNNs are computationally cheap but cannot catch temporal relationships; 3D CNNs can achieve good performance but are computationally intensive. In this work, we tackle this dilemma by designing a generic and effective module that can be embedded into 2D CNNs. To this end, we propose a spAtio-temporal, Channel and moTion excitatION (ACTION) module consisting of three paths: Spatio-Temporal Excitation (STE) path, Channel Excitation (CE) path, and Motion Excitation (ME) path. The STE path employs one channel 3D convolution to characterize spatio-temporal representation. The CE path adaptively recalibrates channel-wise feature responses by explicitly modeling interdependencies between channels in terms of the temporal aspect. The ME path calculates feature-level temporal differences, which is then utilized to excite motion-sensitive channels. We equip 2D CNNs with the proposed ACTION module to form a simple yet effective ACTION-Net with very limited extra computational cost. ACTION-Net is demonstrated by consistently outperforming 2D CNN counterparts on three backbones (i.e., ResNet-50, MobileNet V2 and BNInception) employing three datasets (i.e., Something-Something V2, Jester, and EgoGesture). Codes are available at \url{https://github.com/V-Sense/ACTION-Net}.
翻译:常规 2D CNN 计算成本低,但无法捕捉时间关系; 3D CNN 能够取得良好业绩,但具有计算强度。 在这项工作中,我们通过设计一个通用和有效的模块来应对这一两难困境,该模块可以嵌入2D CNN 。为此,我们提议了一个spAtio-时间、频道和moTion Excation(ACTION)模块,由三条路径组成: Spatio-时间感应(STE) 路径、 频道感应(CE) 路径和运动感应(ME) 路径。STE 路径使用一个3D 变换渠道来描述双时代表。 CE 路径通过明确模拟各频道在时间方面的相互依存关系、 频道和 motion Excortation(AC) 模式计算地平级时间差异,然后用于Excite- 运动感应感应渠道。 我们在2D NCMS- Developmental 3 Syal-deal-deal Semblemental-dealations.