We present prompt distribution learning for effectively adapting a pre-trained vision-language model to address downstream recognition tasks. Our method not only learns low-bias prompts from a few samples but also captures the distribution of diverse prompts to handle the varying visual representations. In this way, we provide high-quality task-related content for facilitating recognition. This prompt distribution learning is realized by an efficient approach that learns the output embeddings of prompts instead of the input embeddings. Thus, we can employ a Gaussian distribution to model them effectively and derive a surrogate loss for efficient training. Extensive experiments on 12 datasets demonstrate that our method consistently and significantly outperforms existing methods. For example, with 1 sample per category, it relatively improves the average result by 9.1% compared to human-crafted prompts.
翻译:我们展示了快速分发学习,以便有效地适应经过培训的愿景语言模式,解决下游识别任务。我们的方法不仅从几个样本中学习低偏差提示,而且还捕捉了处理不同视觉表现的不同提示的分布。这样,我们提供了高质量的任务相关内容,以促进识别。这种迅速分发学习是通过一种有效的方法实现的,该方法可以学习速率的输出嵌入,而不是输入嵌入。因此,我们可以使用高斯语的分布来有效地模拟它们,并获得高效培训的替代损失。对12套数据集的广泛实验表明,我们的方法一贯且大大超过现有方法。例如,每类有1个样本,它相对提高了平均结果9.1%比人造的速。