Vision-Language (V-L) models trained with contrastive learning to align the visual and language modalities have been shown to be strong few-shot learners. Soft prompt learning is the method of choice for few-shot downstream adaption aiming to bridge the modality gap caused by the distribution shift induced by the new domain. While parameter-efficient, prompt learning still requires access to the model weights and can be computationally infeasible for large models with billions of parameters. To address these shortcomings, in this work, we describe a black-box method for V-L few-shot adaptation that (a) operates on pre-computed image and text features and hence works without access to the model's weights, (b) it is orders of magnitude faster at training time, (c) it is amenable to both supervised and unsupervised training, and (d) it can be even used to align image and text features computed from uni-modal models. To achieve this, we propose Linear Feature Alignment (LFA), a simple linear approach for V-L re-alignment in the target domain. LFA is initialized from a closed-form solution to a least-squares problem and then it is iteratively updated by minimizing a re-ranking loss. Despite its simplicity, our approach can even surpass soft-prompt learning methods as shown by extensive experiments on 11 image and 2 video datasets.
翻译:利用对比学习训练的视觉语言(V-L)模型已被证明是强大的few-shot 学习器。软提示学习是few-shot下游适应的首选方法,旨在弥合由新域引起的分布变化导致的模态差距。尽管是参数效率的,但提示学习仍需要访问模型权重,并且对于具有数十亿个参数的大型模型而言,计算代价可能是不可行的。为了解决这些缺点,在本文中,我们提出了一种用于 V-L 少样本适应的黑盒方法,其(a)在预先计算的图像和文本特征上运行,因此可以在没有访问模型权重的情况下工作;(b)训练时速度快几个数量级;(c)对有监督和无监督训练都可行;(d)甚至可以用于对来自单模型模型的图像和文本特征进行对齐。为了实现此目标,我们提出了线性特征对齐(LFA),一种用于在目标域中进行 V-L 重新对齐的简单线性方法。 LFA 从最小二乘问题的封闭形式解决方案初始化,然后通过最小化重新排名损失进行迭代更新。尽管其简单性,我们的方法甚至可以超过软提示学习方法,如在对 11 个图像和 2 个视频数据集进行的大量实验中所示。