Image pre-training, the current de-facto paradigm for a wide range of visual tasks, is generally less favored in the field of video recognition. By contrast, a common strategy is to directly train with spatiotemporal convolutional neural networks (CNNs) from scratch. Nonetheless, interestingly, by taking a closer look at these from-scratch learned CNNs, we note there exist certain 3D kernels that exhibit much stronger appearance modeling ability than others, arguably suggesting appearance information is already well disentangled in learning. Inspired by this observation, we hypothesize that the key to effectively leveraging image pre-training lies in the decomposition of learning spatial and temporal features, and revisiting image pre-training as the appearance prior to initializing 3D kernels. In addition, we propose Spatial-Temporal Separable (STS) convolution, which explicitly splits the feature channels into spatial and temporal groups, to further enable a more thorough decomposition of spatiotemporal features for fine-tuning 3D CNNs. Our experiments show that simply replacing 3D convolution with STS notably improves a wide range of 3D CNNs without increasing parameters and computation on both Kinetics-400 and Something-Something V2. Moreover, this new training pipeline consistently achieves better results on video recognition with significant speedup. For instance, we achieve +0.6% top-1 of Slowfast on Kinetics-400 over the strong 256-epoch 128-GPU baseline while fine-tuning for only 50 epochs with 4 GPUs. The code and models are available at https://github.com/UCSC-VLAA/Image-Pretraining-for-Video.
翻译:在视频识别领域,目前对广泛视觉任务进行图像前训练的脱facto范式通常不那么偏好于视频识别领域。相比之下,一个共同的战略是从零开始直接用超时神经神经网络(CNNs)来直接培训。然而,有趣的是,我们注意到,通过更仔细地研究从Scratch学到的CNN的这些图像,存在某些3D内核内核,这些内核显示出比其他人更强的外观模拟能力,可能表明外观信息在学习中已经非常分解。受此观察的启发,我们假设有效利用图像前训练的关键在于学习空间和时间特征的分解,以及重新审视图像前导作为启动3D内核前核网络之前的外核。此外,我们提议将特效渠道分为空间和时空组,以便进一步更彻底地分解变速信息。 3DCNNCM的精度前核前核特征。我们实验显示,在50G-40级学习前核测试中仅取代3D级的直流,同时将持续进行新的STVS-40级和不断升级。显著的升级。