Contrastive Language-Image Pre-training (CLIP) has emerged as a simple yet effective way to train large-scale vision-language models. CLIP demonstrates impressive zero-shot classification and retrieval on diverse downstream tasks. However, to leverage its full potential, fine-tuning still appears to be necessary. Fine-tuning the entire CLIP model can be resource-intensive and unstable. Moreover, recent methods that aim to circumvent this need for fine-tuning still require access to images from the target distribution. In this paper, we pursue a different approach and explore the regime of training-free "name-only transfer" in which the only knowledge we possess about the downstream task comprises the names of downstream target categories. We propose a novel method, SuS-X, consisting of two key building blocks -- SuS and TIP-X, that requires neither intensive fine-tuning nor costly labelled data. SuS-X achieves state-of-the-art zero-shot classification results on 19 benchmark datasets. We further show the utility of TIP-X in the training-free few-shot setting, where we again achieve state-of-the-art results over strong training-free baselines. Code is available at https://github.com/vishaal27/SuS-X.
翻译:培训前语言对比图像(CLIP)已成为培训大规模视觉语言模型的简单而有效的方法。 CLIP展示了令人印象深刻的零点分类和对各种下游任务进行检索,然而,为了充分发挥潜力,微调似乎仍有必要。微调整个CLIP模式可能耗费大量资源和不稳定资源。此外,最近旨在回避微调需求的方法仍然需要查阅目标分布的图像。在本文中,我们采用不同的方法并探索无培训的“只用姓名传输”制度,我们掌握的关于下游任务的唯一知识包括下游目标类别的名称。我们提出了一个新颖的方法,SuS-X,由两个关键的建筑块组成 -- -- SuS和TIP-X,这既不需要密集的微调,也不需要昂贵的标签数据。 SuS-X在19个基准数据集上达到最先进的零点分类结果。我们进一步展示了TIP-X在免费培训低镜头设置中的效用,我们在这里再次获得州-f-th-art结果。</s>