The modeling, computational cost, and accuracy of traditional Spatio-temporal networks are the three most concentrated research topics in video action recognition. The traditional 2D convolution has a low computational cost, but it cannot capture the time relationship; the convolutional neural networks (CNNs) model based on 3D convolution can obtain good performance, but its computational cost is high, and the amount of parameters is large. In this paper, we propose a plug-and-play Spatio-temporal Shift Module (STSM), which is a generic module that is both effective and high-performance. Specifically, after STSM is inserted into other networks, the performance of the network can be improved without increasing the number of calculations and parameters. In particular, when the network is 2D CNNs, our STSM module allows the network to learn efficient Spatio-temporal features. We conducted extensive evaluations of the proposed module, conducted numerous experiments to study its effectiveness in video action recognition, and achieved state-of-the-art results on the kinetics-400 and Something-Something V2 datasets.
翻译:传统的空间时空网络的模型、计算成本和精确度是视频行动识别的三个最集中的研究课题。传统的2D演进的计算成本较低,但无法捕捉时间关系;基于3D演进的进进神经网络模型可以取得良好的性能,但其计算成本很高,参数数量也很大。在本文中,我们提出了一个插插插和播放的时空移动模块(STSM),这是一个既有效又高性能的通用模块。具体来说,在将STSM插入其他网络之后,网络的性能可以提高,而不会增加计算和参数的数量。特别是,当网络是2DCNN时,我们的SSTSM模块允许网络学习高效的Spatio时空特征。我们广泛评价了拟议的模块,进行了许多实验,以研究其在视频动作识别方面的有效性,并实现了运动-400和某些东西-可移动V2数据集方面的最先进的结果。