This paper presents a simple and effective visual prompting method for adapting pre-trained models to downstream recognition tasks. Our method includes two key designs. First, rather than directly adding together the prompt and the image, we treat the prompt as an extra and independent learnable component. We show that the strategy of reconciling the prompt and the image matters, and find that warping the prompt around a properly shrinked image empirically works the best. Second, we re-introduce two "old tricks" commonly used in building transferable adversarial examples, i.e., input diversity and gradient normalization, into visual prompting. These techniques improve optimization and enable the prompt to generalize better. We provide extensive experimental results to demonstrate the effectiveness of our method. Using a CLIP model, our prompting method sets a new record of 82.8% average accuracy across 12 popular classification datasets, substantially surpassing the prior art by +5.6%. It is worth noting that this prompting performance already outperforms linear probing by +2.1% and can even match fully fine-tuning in certain datasets. In addition, our prompting method shows competitive performance across different data scales and against distribution shifts. The code is publicly available at https://github.com/UCSC-VLAA/EVP.
翻译:像素级别视觉提示的潜能释放
本文提出了一种简单有效的视觉提示方法,用于适应预训练模型到下游识别任务。我们的方法包括两个关键设计。首先,我们将提示视为一个额外的、可学习的独立组件,而不是直接将提示和图像相加。我们发现,协调提示和图像的策略很重要,研究表明,在正确收缩图像周围扭曲提示效果最好。其次,我们将构建可转移的对抗样本时通常采用的两个“老技巧”——输入多样性和梯度归一化——重新引入到视觉提示中。这些技术提高了优化效果,使提示能够更好地推广。我们提供了大量实验结果来证明我们方法的有效性。在使用 CLIP 模型的情况下,我们的提示方法在 12 个常用分类数据集上取得了 82.8% 的平均准确率,大幅超过先前的艺术成果 +5.6%。值得注意的是,这种提示性能已经超过线性探针 +2.1%,在某些数据集中甚至可以匹配完全微调。此外,我们的提示方法在不同的数据规模和对分布偏移的竞争性表现也很强。代码公开可用于 https://github.com/UCSC-VLAA/EVP。