Prompt learning is one of the most effective and trending ways to adapt powerful vision-language foundation models like CLIP to downstream datasets by tuning learnable prompt vectors with very few samples. However, although prompt learning achieves excellent performance over in-domain data, it still faces the major challenge of generalizing to unseen classes and domains. Some existing prompt learning methods tackle this issue by adaptively generating different prompts for different tokens or domains but neglecting the ability of learned prompts to generalize to unseen domains. In this paper, we propose a novel prompt learning paradigm that directly generates domain invariant prompt generalizable to unseen domains, called MetaPrompt. Specifically, a dual-modality prompt tuning network is proposed to generate prompts for inputs from both image and text modalities. More importantly, we propose a meta-learning-based prompt tuning algorithm that explicitly constrains the prompt tuned on a specific domain or class also to achieve good performance on another domain or class. Extensive experiments on 11 datasets for base-to-new generalization and four datasets for domain generalization demonstrate that our method consistently and significantly outperforms existing methods.
翻译:快速学习是使强大的视觉语言基础模型(如CLIP)适应下游数据集的最有效和最有趋势的方法之一,例如CLIP,通过以极少的样本对可学习的快速矢量进行调控,使强大的视觉基础模型(如CLIP)适应下游数据集。然而,尽管快速学习在内部数据上取得了优异的性能,但是仍然面临着推广到隐蔽类别和领域的重大挑战。一些现有的快速学习方法通过适应性地生成不同符号或领域的不同提示来解决这一问题,但却忽视了将学识性提示推广到隐蔽域的能力。在本文中,我们提出了一个新型快速学习模式的新模式模式模式,直接生成可迅速推广到无形域域域域域域域域域域域域域域域域域域域域域域域域域域域域域域域域域域域域域域域域域域域域域域域域域域域域域域域域域域域域域域域域域名,称为MetaPrompt。具体地说,提议一个双模式快速调控调网络,以生成图像和文本模式的投入。更重要的是,我们提议一种基于元学习的快速调法,明确限制在另一个域域域域域域域域域域域域域域域域域域域域域域域域域域域域域或类内取得良好的现有方法。