We revisit and advance visual prompting (VP), an input prompting technique for vision tasks. VP can reprogram a fixed, pre-trained source model to accomplish downstream tasks in the target domain by simply incorporating universal prompts (in terms of input perturbation patterns) into downstream data points. Yet, it remains elusive why VP stays effective even given a ruleless label mapping (LM) between the source classes and the target classes. Inspired by the above, we ask: How is LM interrelated with VP? And how to exploit such a relationship to improve its accuracy on target tasks? We peer into the influence of LM on VP and provide an affirmative answer that a better 'quality' of LM (assessed by mapping precision and explanation) can consistently improve the effectiveness of VP. This is in contrast to the prior art where the factor of LM was missing. To optimize LM, we propose a new VP framework, termed ILM-VP (iterative label mapping-based visual prompting), which automatically re-maps the source labels to the target labels and progressively improves the target task accuracy of VP. Further, when using a contrastive language-image pretrained (CLIP) model, we propose to integrate an LM process to assist the text prompt selection of CLIP and to improve the target task accuracy. Extensive experiments demonstrate that our proposal significantly outperforms state-of-the-art VP methods. As highlighted below, we show that when reprogramming an ImageNet-pretrained ResNet-18 to 13 target tasks, our method outperforms baselines by a substantial margin, e.g., 7.9% and 6.7% accuracy improvements in transfer learning to the target Flowers102 and CIFAR100 datasets. Besides, our proposal on CLIP-based VP provides 13.7% and 7.1% accuracy improvements on Flowers102 and DTD respectively. Our code is available at https://github.com/OPTML-Group/ILM-VP.
翻译:我们重新审视和推进视觉提示(visual prompting,VP),这是一种用于视觉任务的输入提示技术。VP可以通过在下游数据点中加入通用的提示(以输入扰动模式为形式),来重新调整一个固定的、预训练的源模型,从而实现在目标领域中的下游任务。然而,即使给出了一个没有规则的标签映射(label mapping,LM)从源类到目标类的映射,VP的有效性仍然难以捉摸。受到这个启发,我们提出如下问题:LM如何与VP相互关联?如何利用这种关系来提高VP在目标任务上的准确性?我们探讨了LM对VP的影响,并给出了一个肯定的答案:更好的LM“质量”(通过映射精度和解释性进行评估)可以持续提高VP的有效性。这与先前的研究成果不同,其中LM的因素被忽略了。为了优化LM,我们提出了一种新的VP框架,称为ILM-VP(iterative label mapping-based visual prompting),它可以自动将源标签重新映射到目标标签,并逐步提高VP在目标任务上的准确性。此外,在使用对比性语言-图像预训练(CLIP)模型时,我们建议集成LM过程以协助CLIP的文本提示选择,并提高目标任务的准确性。广泛的实验表明,我们的提议显著优于最先进的VP方法。如下所示,当将ImageNet预训练的ResNet-18重新编程为13个目标任务时,我们的方法优于基线方法,例如,在迁移到目标Flowers102和CIFAR100数据集时分别提高了7.9%和6.7%的准确性。此外,我们在基于CLIP的VP方面的提议,在Flowers102和DTD上分别提供了13.7%和7.1%的准确性提高。我们的代码可在https://github.com/OPTML-Group/ILM-VP上获得。