While state-of-the-art 3D Convolutional Neural Networks (CNN) achieve very good results on action recognition datasets, they are computationally very expensive and require many GFLOPs. While the GFLOPs of a 3D CNN can be decreased by reducing the temporal feature resolution within the network, there is no setting that is optimal for all input clips. In this work, we therefore introduce a differentiable Similarity Guided Sampling (SGS) module, which can be plugged into any existing 3D CNN architecture. SGS empowers 3D CNNs by learning the similarity of temporal features and grouping similar features together. As a result, the temporal feature resolution is not anymore static but it varies for each input video clip. By integrating SGS as an additional layer within current 3D CNNs, we can convert them into much more efficient 3D CNNs with adaptive temporal feature resolutions (ATFR). Our evaluations show that the proposed module improves the state-of-the-art by reducing the computational cost (GFLOPs) by half while preserving or even improving the accuracy. We evaluate our module by adding it to multiple state-of-the-art 3D CNNs on various datasets such as Kinetics-600, Kinetics-400, mini-Kinetics, Something-Something V2, UCF101, and HMDB51.
翻译:虽然最先进的3D进化神经网络(CNN)在行动识别数据集上取得了非常好的成果,但它们在计算上非常昂贵,需要许多GFLOP。 虽然3DCNN的GFLOP可以通过降低网络内的时间特征分辨率而减少 GFLOPs, 但是没有适合所有输入剪辑的最佳设置。 因此, 在这项工作中, 我们引入了一个可插入任何现有3DCNN架构的可与现有3D CNN结构的相似性指导抽样(SGS)模块。 SGS通过学习时间特征的相似性和将类似特征组合在一起来增强3DCNN的权能。 结果, 时间特征分辨率不再是静止的,而是每个输入视频剪辑的不同。 通过将SGS作为新增的一层纳入目前的3DCNN, 我们可以将其转换为效率更高的3DCNN, 具有适应性的时间特征分辨率分辨率分辨率(ATFR)。 我们的评估显示, 拟议的模块通过将计算成本(GFLOPs)减半, 保存甚至改进了准确性能。 我们通过将模型作为KINK-K- SIMD的多州、K-S-S- smex、K- six- smal-s、K-s-s-stan-stan-stan-taild-stan-s-fine-s-stan-ta、K-stan-s-stand-s-stan- sem-s-s-stan-s-stan-stantims-stantims-hi-hi-s-hi-hi-stan-stant-hi-hi-ta-ta-stan-stan-stans-s-