Effective extraction of temporal patterns is crucial for the recognition of temporally varying actions in video. We argue that the fixed-sized spatio-temporal convolution kernels used in convolutional neural networks (CNNs) can be improved to extract informative motions that are executed at different time scales. To address this challenge, we present a novel spatio-temporal convolution block that is capable of extracting spatio-temporal patterns at multiple temporal resolutions. Our proposed multi-temporal convolution (MTConv) blocks utilize two branches that focus on brief and prolonged spatio-temporal patterns, respectively. The extracted time-varying features are aligned in a third branch, with respect to global motion patterns through recurrent cells. The proposed blocks are lightweight and can be integrated into any 3D-CNN architecture. This introduces a substantial reduction in computational costs. Extensive experiments on Kinetics, Moments in Time and HACS action recognition benchmark datasets demonstrate competitive performance of MTConvs compared to the state-of-the-art with a significantly lower computational footprint.
翻译:有效抽取时间模式对于承认视频中的时间变化作用至关重要。 我们辩称,在进化神经网络中使用的固定大小的时空脉动内核可以改进,以提取在不同时间尺度上执行的信息动作。为了应对这一挑战,我们提出了一个新的时空脉动内核组合块,能够在多个时空分辨率上提取时空形态。我们提议的多时空脉动(MTConv)区块使用两个分支,分别侧重于短暂和长期的时空形态。抽取的时间变化特征在第三个分支中进行了调整,涉及通过经常性细胞的全球运动模式。拟议的区块是轻量的,可以纳入任何3D-CNN结构。这导致计算成本的大幅降低。关于动因学、时运动和HACS行动识别基准数据集的广泛实验显示了MTConvey 与计算足迹明显较低的状态相比的竞争性表现。