This paper is on soft prompt learning for Vision \& Language (V&L) models. Similarly to their NLP counterparts, V\&L models can be adapted to a downstream task by learning soft continuous prompts using a few training examples. Current methods learn the soft prompts by minimizing a cross-entropy loss using as class weights the features obtained by passing the prompts plus the class names through the text encoder. Such methods, however, significantly overfit the training data suffering from large accuracy degradation when tested on unseen classes from the same domain. Our main contribution, in this paper, is a surprisingly simple approach to alleviate this problem: we use a second cross entropy loss to minimize the distance between the learned soft prompts and a set of hand-engineered manual prompts (obtained by prompt engineering). The proposed loss can be interpreted in multiple ways including as a regularizer, as a means for language-based augmentation, and as a way of learning more discriminative class centroids. Importantly, our formulation is inherently amenable to including, during training, virtual classes, i.e. class names for which no visual samples are available, further increasing the robustness of the learned prompts. Through extensive evaluations on 11 datasets, we show that our approach (a) significantly outperforms all prior works on soft prompting, and (b) matches and surpasses, for the first time, the accuracy on novel classes obtained by hand-crafted prompts and CLIP for the majority of the test datasets. Code will be made available.
翻译:本文是关于视觉语言( V&L) 模型的软快速学习 。 与 NLP 模型相似, V ⁇ L 模型可以通过使用几个培训实例学习软连续的软连续提示, 适应下游任务 。 目前的方法可以学习软快速, 将通过通过文本编码器传递的提示和类名获得的特征作为课堂重量, 最大限度地减少交叉热带损失 。 但是, 这种方法大大超出了在同一域的无法见的类中测试导致的高度精确降解的培训数据 。 我们的主要贡献, 与 NLP 对应者类似, 我们的主要贡献, 是缓解这一问题的一个令人惊讶的简单方法 : 我们使用第二个交叉恒星损失, 以最大限度地减少学习的软软连续提示和手工设计手册提示之间的距离 。 提议的损失可以用多种方式来解释, 包括一个常规化, 作为基于语言的增强手段, 以及学习更具有歧视性的类类缩略图 。 重要的是, 我们的配方在培训、 虚拟课程中, 也就是没有视觉样本的班级名称, 将进一步提高我们所学的准确性 。