With the emergence of large pre-trained vison-language model like CLIP, transferrable representations can be adapted to a wide range of downstream tasks via prompt tuning. Prompt tuning tries to probe the beneficial information for downstream tasks from the general knowledge stored in both the image and text encoders of the pre-trained vision-language model. A recently proposed method named Context Optimization (CoOp) introduces a set of learnable vectors as text prompt from the language side, while tuning the text prompt alone can not affect the computed visual features of the image encoder, thus leading to sub-optimal. In this paper, we propose a dual modality prompt tuning paradigm through learning text prompts and visual prompts for both the text and image encoder simultaneously. In addition, to make the visual prompt concentrate more on the target visual concept, we propose Class-Aware Visual Prompt Tuning (CAVPT), which is generated dynamically by performing the cross attention between language descriptions of template prompts and visual class token embeddings. Our method provides a new paradigm for tuning the large pre-trained vision-language model and extensive experimental results on 8 datasets demonstrate the effectiveness of the proposed method. Our code is available in the supplementary materials.
翻译:随着像CLIP这样的大型预先培训的相对语言模型的出现,可移植的表达方式可以通过快速调适来适应一系列广泛的下游任务。 快速调试尝试从预培训的视觉语言模型的图像和文本编码器中存储的一般知识中探寻有益于下游任务的信息。 最近提出的一种名为“ 环境优化” (CoOp) 的方法引入了一套可学习的矢量,作为语言侧面的文本提示,而光调快速的文本不会影响图像编码器的计算视觉特征,从而导致次优化。 在本文中,我们提出一种双模式快速调控模式模式,通过学习文本提示和图像编码器的视觉提示同时对下游任务进行。此外,为了让视觉提示更多集中于目标视觉概念,我们提议了一套名为“ 视觉快速优化” (CAVAVPTT) 的方法,这是通过对模板提示和视觉类符号嵌入的语言描述进行交叉关注而动态生成的。 我们的方法为调整大型预培训的视觉语言模型模型模型提供了一种新的模式模式模式, 并且广泛实验性结果显示我们在8号的拟议数据设置中提供的方法。