Prompt learning has been designed as an alternative to fine-tuning for adapting Vision-language (V-L) models to the downstream tasks. Previous works mainly focus on text prompt while visual prompt works are limited for V-L models. The existing visual prompt methods endure either mediocre performance or unstable training process, indicating the difficulty of visual prompt learning. In this paper, we propose a new Progressive Visual Prompt (ProVP) structure to strengthen the interactions among prompts of different layers. More importantly, our ProVP could effectively propagate the image embeddings to deep layers and behave partially similar to an instance adaptive prompt method. To alleviate generalization deterioration, we further propose a new contrastive feature re-formation, which prevents the serious deviation of the prompted visual feature from the fixed CLIP visual feature distribution. Combining both, our method (ProVP-Ref) is evaluated on 11 image benchmark datasets and achieves 7/11 state-of-theart results on both few-shot and base-to-novel settings. To the best of our knowledge, we are the first to demonstrate the superior performance of visual prompts in V-L models to previous prompt-based methods in downstream tasks. Meanwhile, it implies that our ProVP-Ref shows the best capability to adapt and to generalize.
翻译:提示学习被设计为将视觉语言(V-L)模型适应于下游任务的一种替代fine-tuning的方法。以前的工作主要集中在文本提示,而视觉提示的工作仅限于V-L模型。现有的视觉提示方法要么表现一般,要么具有不稳定的训练过程,这表明了视觉提示学习的困难性。在本文中,我们提出了一种新的Progressive Visual Prompt(ProVP) 结构,以加强不同层次的提示之间的交互作用。更重要的是,我们的ProVP可以有效地将图像嵌入传播到深层,并部分类似于实例自适应提示方法。为了缓解泛化恶化,我们进一步提出了一种新的对比特征重塑方法,它可以防止提示的视觉特征与固定的CLIP视觉特征分布严重偏差。两种方法的结合,我们的方法(ProVP-Ref)在11个图像基准数据集上进行了评估,并在几乎所有数据集上都取得了基于各种衡量指标的优异表现,包括在少样本学习和基于全数据集到新类别的学习设置下。据我们所知,我们是第一个证明在下游任务中视觉提示方法具有优异表现的工作。同时,这也意味着我们的ProVP-Ref展现了最佳的适应和泛化能力。