With the advent of vision-language models (VLMs) that can perform in-context and prompt-based learning, how can we design prompting approaches that robustly generalize to distribution shift and can be used on novel classes outside the support set of the prompts? In this work, we first define two types of robustness to distribution shift on VLMs, namely, robustness on base classes (the classes included in the support set of prompts) and robustness on novel classes. Then, we study the robustness of existing in-context learning and prompt learning approaches, where we find that prompt learning performs robustly on test images from base classes, while it does not generalize well on images from novel classes. We propose robust prompt learning by integrating multiple-scale image features into the prompt, which improves both types of robustness. Comprehensive experiments are conducted to study the defined robustness on six benchmarks and show the effectiveness of our proposal.
翻译:随着能够进行上下文和提示学习的视觉语言模型(VLMs)的出现,我们如何设计抵御分布转移的提示方法,并在提示的支持集之外的新类别上进行稳健通用性?在本文中,我们首先定义了VLMs上的两种分布转移稳健性,即基类别上的稳健性(包含在提示支持集中的类别)和新类别上的稳健性。然后,我们研究了现有上下文学习和提示学习方法的稳健性,发现提示学习在来自基类别的测试图像上表现良好,但不能很好地推广到来自新类别的图像上。我们提出了通过将多尺度图像特征集成到提示中来提高两种稳健性的稳健提示学习方法。
我们进行了综合实验,研究了六个基准测试的定义稳健性,并展示了我们提案的有效性。