Video-based action recognition is one of the most popular topics in computer vision. With recent advances of selfsupervised video representation learning approaches, action recognition usually follows a two-stage training framework, i.e., self-supervised pre-training on large-scale unlabeled sets and transfer learning on a downstream labeled set. However, catastrophic forgetting of the pre-trained knowledge becomes the main issue in the downstream transfer learning of action recognition, resulting in a sub-optimal solution. In this paper, to alleviate the above issue, we propose a novel transfer learning approach that combines self-distillation in fine-tuning to preserve knowledge from the pre-trained model learned from the large-scale dataset. Specifically, we fix the encoder from the last epoch as the teacher model to guide the training of the encoder from the current epoch in the transfer learning. With such a simple yet effective learning strategy, we outperform state-of-the-art methods on widely used UCF101 and HMDB51 datasets in action recognition task.
翻译:以视频为基础的行动识别是计算机愿景中最受欢迎的议题之一。随着自监督的视频代表学习方法的最新进展,行动识别通常遵循一个两阶段培训框架,即对大型无标签数据集的自监督前培训和下游标签数据集的转让学习。然而,灾难性地忘记预先培训的知识成为下游行动识别学习中的主要问题,从而形成一个亚最佳解决方案。在本文中,为了缓解上述问题,我们提议了一种新的传输学习方法,将自我蒸馏在微调中结合起来,以保存从从大规模数据集中学得的预先培训模型中获得的知识。具体地说,我们把上个时代的编码器作为教师模型,用以指导从当前转让学习中的环形器对编码器的培训。有了这样一个简单而有效的学习战略,我们在行动识别任务中超越了广泛使用的UCF101和HMDB51数据集的先进方法。