Transferring knowledge from task-agnostic pre-trained deep models for downstream tasks is an important topic in computer vision research. Along with the growth of computational capacity, we now have open-source Vision-Language pre-trained models in large scales of the model architecture and amount of data. In this study, we focus on transferring knowledge for vision classification tasks. Conventional methods randomly initialize the linear classifier head for vision classification, but they leave the usage of the text encoder for downstream visual recognition tasks undiscovered. In this paper, we revise the role of the linear classifier and replace the classifier with the embedded language representations of the object categories. These language representations are initialized from the text encoder of the vision-language pre-trained model to further utilize its well-pretrained language model parameters. The empirical study shows that our method improves both the performance and the training speed of video classification, with a negligible change in the model. In particular, our paradigm achieves the state-of-the-art accuracy of 87.3% on Kinetics-400.
翻译:在为下游任务转让知识方面,先经过培训的任务性深层模型是计算机视觉研究的一个重要课题。随着计算能力的增长,我们现在有了在模型结构和数据数量大尺度上开放源码的视野语言预培训模型。在这项研究中,我们的重点是为愿景分类任务转让知识。常规方法随机地为愿景分类创建线性分类器头,但是没有发现下游视觉识别任务使用文字编码器。在本文中,我们修改线性分类器的作用,用目标类别的嵌入语言表示方式取代分类器。这些语言表示方式是从先导的视觉语言预培训模型的文本编码器初始化的,以进一步利用其经过良好培训的语言模型参数。经验研究表明,我们的方法既提高了视频分类的性能和培训速度,也使模型中的变化微不足道。特别是,我们的模型在Kinetics-400上实现了87.3%的最新精确度。