With the emergence of large pre-trained vison-language model like CLIP, transferable representations can be adapted to a wide range of downstream tasks via prompt tuning. Prompt tuning tries to probe the beneficial information for downstream tasks from the general knowledge stored in the pre-trained model. A recently proposed method named Context Optimization (CoOp) introduces a set of learnable vectors as text prompt from the language side. However, tuning the text prompt alone can only adjust the synthesized "classifier", while the computed visual features of the image encoder can not be affected , thus leading to sub-optimal solutions. In this paper, we propose a novel Dual-modality Prompt Tuning (DPT) paradigm through learning text and visual prompts simultaneously. To make the final image feature concentrate more on the target visual concept, a Class-Aware Visual Prompt Tuning (CAVPT) scheme is further proposed in our DPT, where the class-aware visual prompt is generated dynamically by performing the cross attention between text prompts features and image patch token embeddings to encode both the downstream task-related information and visual instance information. Extensive experimental results on 11 datasets demonstrate the effectiveness and generalization ability of the proposed method. Our code is available in https://github.com/fanrena/DPT.
翻译:随着像CLIP这样的大型预先培训的相对语言模型的出现,可转让的表示方式可以通过迅速调整来适应一系列广泛的下游任务。 快速调试尝试从预培训模型中储存的一般知识中探寻下游任务有益的信息。 最近提出的一种名为“ 环境优化” (CoOp) 的方法引入了一组可学习的矢量作为语言方面的文字提示。 然而, 单对文本进行调试只能调整合成的“ 分类器 ”, 而图像编码器的计算视觉特征不会受到影响,从而导致亚最佳的解决方案。 在本文中,我们通过学习文本和视觉提示同时,提出一个新的双式快速提款(DPT) 模式(DPT) 为下游任务提供有益的信息。 为了让最终图像功能更多地集中于目标视觉概念,我们的DPT(DPT) 进一步提议了一个“ 级软件视觉提示(CAVPTT) 计划, 通过执行文本提示特性的交叉关注和图像标记嵌入与下游任务/ 任务/ DA 工具中的拟议数据能力展示。 我们的实验性代码将展示了现有的数据/ 。