Large-scale vision-language models (VLMs) pre-trained on billion-level data have learned general visual representations and broad visual concepts. In principle, the well-learned knowledge structure of the VLMs should be inherited appropriately when being transferred to downstream tasks with limited data. However, most existing efficient transfer learning (ETL) approaches for VLMs either damage or are excessively biased towards the prior knowledge, e.g., prompt tuning (PT) discards the pre-trained text-based classifier and builds a new one while adapter-style tuning (AT) fully relies on the pre-trained features. To address this, we propose a new efficient tuning approach for VLMs named Task Residual Tuning (TaskRes), which performs directly on the text-based classifier and explicitly decouples the prior knowledge of the pre-trained models and new knowledge regarding a target task. Specifically, TaskRes keeps the original classifier weights from the VLMs frozen and obtains a new classifier for the target task by tuning a set of prior-independent parameters as a residual to the original one, which enables reliable prior knowledge preservation and flexible task-specific knowledge exploration. The proposed TaskRes is simple yet effective, which significantly outperforms previous ETL methods (e.g., PT and AT) on 11 benchmark datasets while requiring minimal effort for the implementation. Our code is available at https://github.com/geekyutao/TaskRes.
翻译:大规模的视觉语言模型(VLM)预训练于十亿级别的数据上,已经学会了广泛的图像概念和通用的视觉表示。理论上,当这些VLM使用于具有限数据量的下游任务时,应当适当地继承已学习的知识结构。但是,现存的VLM有效迁移学习方法,如提示微调法(Prompt Tuning)和适配器式微调法(Adapter-style Tuning),要么损坏先前知识要么过度偏置,它们对于先前的知识结构无法很好地继承。为了解决这个问题,我们提出了一种新的VLM有效微调方法,被命名为任务残差法(Task Residual Tuning,TaskRes)。它直接用于文本分类器,并显式地将预训练模型的先前知识和目标任务的新知识解耦。具体而言,TaskRes将VLM原分类器的权重冻结,通过调整预独立参数的残差来得到适用于目标任务的新分类器,从而实现可靠的知识传承和灵活的任务特定知识探索。TaskRes方法简单而有效,极大地提高了11个基准数据集上的微调性能,而且实现简单。我们的代码可以在 https://github.com/geekyutao/TaskRes 上获得。