While state-of-the-art 3D Convolutional Neural Networks (CNN) achieve very good results on action recognition datasets, they are computationally very expensive and require many GFLOPs. While the GFLOPs of a 3D CNN can be decreased by reducing the temporal feature resolution within the network, there is no setting that is optimal for all input clips. In this work, we therefore introduce a differentiable Similarity Guided Sampling (SGS) module, which can be plugged into any existing 3D CNN architecture. SGS empowers 3D CNNs by learning the similarity of temporal features and grouping similar features together. As a result, the temporal feature resolution is not anymore static but it varies for each input video clip. By integrating SGS as an additional layer within current 3D CNNs, we can convert them into much more efficient 3D CNNs with adaptive temporal feature resolutions (ATFR). Our evaluations show that the proposed module improves the state-of-the-art by reducing the computational cost (GFLOPs) by half while preserving or even improving the accuracy. We evaluate our module by adding it to multiple state-of-the-art 3D CNNs on various datasets such as Kinetics-600, Kinetics-400, mini-Kinetics, Something-Something~V2, UCF101, and HMDB51.
翻译:虽然最先进的3D进化神经网络(CNN)在行动识别数据集上取得了非常好的成果,但它们在计算上非常昂贵,需要许多GFLOP。 虽然3DCNN的GFLOP可以通过降低网络内的时间特征分辨率来减少GFLOP, 但对于所有输入剪辑来说,没有最合适的设置。 因此, 在这项工作中, 我们引入了一个可插入任何现有3DCNN架构的可与众不同的类似制导抽样模块。 SGS通过学习时间特征的相似性和将类似特征组合在一起来增强3DCNN的功能。 结果, 时间特征分辨率不再是静止的,而是每个输入视频剪辑的不同。 通过将SGS作为目前3DCNN的新增层, 我们可以将其转换为效率更高的3DCNN和适应性时间特征分辨率分辨率(ATFR)。 我们的评估表明, 拟议的模块通过将计算成本(GFLOPs)减半,同时保存甚至改进精确度。 我们评估了我们的模块, 作为KINK-K-S-S-S-S-S-S-S-S-S-MIC-S-S-S-S-S-SIMD-S-S-S-S-S-S-S-x-x-S-x-x-x-S-x-SIMD-x-x-x-S-x-x-S-S-x-x-S-S-S-x-x-x-x-x-x-x-x-x-x-x-x-x-x-x-x-x-x-x-x-x-x-x-x-x-x-xxxxx-x-x-xxxx-x-xxxxxxx-x-x-x-x-x-xxx-x-x-x-x-x-x-x-x-x-xxxxxxxxxxxxxxxxxx-x-xxxxxxxxxxxxxx-x-x-x-x-x-x-x-x-x-x-x-x-x-x-x-x-x-x-x-x-xx-x-x-x-x-