Thanks to the large pre-trained vision-language models (VLMs) like CLIP, we can craft a zero-shot classifier by "prompt", e.g., the confidence score of an image being "[CLASS]" can be obtained by using the VLM provided similarity measure between the image and the prompt sentence "a photo of a [CLASS]". Therefore, prompt shows a great potential for fast adaptation of VLMs to downstream tasks if we fine-tune the prompt-based similarity measure. However, we find a common failure that improper fine-tuning may not only undermine the prompt's inherent prediction for the task-related classes, but also for other classes in the VLM vocabulary. Existing methods still address this problem by using traditional anti-overfitting techniques such as early stopping and data augmentation, which lack a principled solution specific to prompt. We present Prompt-aligned Gradient, dubbed ProGrad, to prevent prompt tuning from forgetting the the general knowledge learned from VLMs. In particular, ProGrad only updates the prompt whose gradient is aligned (or non-conflicting) to the "general direction", which is represented as the gradient of the KL loss of the pre-defined prompt prediction. Extensive experiments demonstrate the stronger few-shot generalization ability of ProGrad over state-of-the-art prompt tuning methods. Codes are available at https://github.com/BeierZhu/Prompt-align.
翻译:由于CLIP等经过事先训练的大型视觉语言模型(VLM),我们可以用“快速”来制作一个零发分解器,例如,使用VLM提供的图像与“CLAS”相近的快速句子之间的类似度量,就可以获得“[CLAS]照片”的置信分。因此,快速显示,如果我们微调基于迅速的类似度度量,那么VLMS的快速适应下游任务就有很大潜力。然而,我们发现一个常见的失败,即不适当的微调不仅会破坏任务相关等级的及时内在预测,而且会破坏VLM词汇中其他等级的“[CLASS]”图像的置信分。现有的方法仍然能够通过使用传统的反适应技术来解决这个问题,例如早期停止和数据扩充,而这种技术缺乏一个具体针对迅速的有原则性的解决办法。我们介绍“快速调整”的普罗格拉德,以便防止迅速调整从忘记从VLMS中学获得的一般知识。特别是,ProGradd只更新其梯度(或非冲突性)与任务相关等级/高级预测测度能力,这代表了Glas-Graftal-laftal-lab-lab-laftal-laudal-laudal-lab-lab-lauddal-lab-lab-lab-lab-lab-lab-lab-lab-lab-ladal-lad-lad-ladal-ladal-lad-lad-lad-lad-lad-lad-lad-lad-lad-lad-lad-lad-lad-lad-lad-lad-lad-lad-lad-lad-lad-lad-lad-lad-lad-lad-lad-lad-lad-lad-lad-ladal-ladal-lad-lad-lad-lad-lad-lad-lad-lad-lad-lad-lad-lad-lad-lad-lad-lad-lad-lad-lad-lad-lad-lad-lad-lad-lad-