Prompt tuning provides an efficient mechanism to adapt large vision-language models to downstream tasks by treating part of the input language prompts as learnable parameters while freezing the rest of the model. Existing works for prompt tuning are however prone to damaging the generalization capabilities of the foundation models, because the learned prompts lack the capacity of covering certain concepts within the language model. To avoid such limitation, we propose a probabilistic modeling of the underlying distribution of prompts, allowing prompts within the support of an associated concept to be derived through stochastic sampling. This results in a more complete and richer transfer of the information captured by the language model, providing better generalization capabilities for downstream tasks. The resulting algorithm relies on a simple yet powerful variational framework that can be directly integrated with other developments. We show our approach is seamlessly integrated into both standard and conditional prompt learning frameworks, improving the performance on both cases considerably, especially with regards to preserving the generalization capability of the original model. Our method provides the current state-of-the-art for prompt learning, surpassing CoCoOp by 1.6% average Top-1 accuracy on the standard benchmark. Remarkably, it even surpasses the original CLIP model in terms of generalization to new classes. Implementation code will be released.
翻译:快速调整提供了一个有效的机制,使大型视觉语言模型适应下游任务,办法是将部分输入语言作为可学习的参数处理,同时冻结模型的其余部分内容。现有的快速调整工作容易破坏基础模型的普及能力,因为学习的迅速缺乏在语言模型中涵盖某些概念的能力。为了避免这种限制,我们建议对快速分配的基本分布进行概率模型,允许在支持相关概念的范围内通过随机抽样取出快速信号。这导致更完整和更丰富地传输语言模型收集的信息,为下游任务提供更好的概括化能力。由此产生的算法依赖于一个简单而有力的变异框架,可以直接与其他发展结合起来。我们表明,我们的方法无缝地融入了标准框架和有条件的即时学习框架,大大改进了这两个案例的绩效,特别是在维护原始模型的普及能力方面。我们的方法为迅速学习提供了当前的最新技术,在标准基准上超过了1.6%的平均顶层-1准确度,从而使得新的标准化模型的原始版本超越了CLIP。