Contrastive vision-language models like CLIP have shown great progress in transfer learning. In the inference stage, the proper text description, also known as prompt, needs to be carefully designed to correctly classify the given images. In order to avoid laborious prompt engineering, recent works such as CoOp, CLIP-Adapter and Tip-Adapter propose to adapt vision-language models for downstream image recognition tasks on a small set of labeled data. Though promising improvements are achieved, requiring labeled data from the target datasets may restrict the scalability. In this paper, we explore a different scenario, in which the labels of the target datasets are unprovided, and we present an unsupervised prompt learning (UPL) approach to avoid prompt engineering while simultaneously improving transfer performance of CLIP-like vision-language models. As far as we know, UPL is the first work to introduce unsupervised learning into prompt learning. Experimentally, our UPL outperforms original CLIP with prompt engineering on ImageNet as well as other 10 datasets. An enhanced version of UPL is even competitive with the 8-shot CoOp and the 8-shot TIP-Adapter on most datasets. Code and models are available at https://github.com/tonyhuang2022/UPL.
翻译:类似 CLIP 和 Tip-Adapter 等对比式视觉语言模型在传输学习中表现出了巨大的进步。 在推断阶段, 正确的文本描述( 也称为快速) 需要谨慎设计, 以正确分类给定图像。 为了避免费力的迅速工程, 最近的一些工程, 如 COop、 CLIP- Adapter 和 Tip- Adapter 等, 提议将视觉语言模型用于下游图像识别任务, 用于贴标签的一小套贴标签数据。 尽管已经取得了有希望的改进, 但需要目标数据集的标签数据可能会限制可缩放性。 在本文中, 我们探索了一种不同的方案, 即不提供目标数据集的标签, 并且我们展示了一种不受监督的即时速学习( UPL) 方法, 以避免快速工程, 同时改进 CLIP- 类似视觉语言模型的传输性能。 据我们所知, UPL 是引入非超度学习快速学习的首项工作。 实验性, 我们的UPL 将原始 CLPIP 与图像网络和其他10个数据集的快速工程进行。 。 在 TUPADA- Od- 上, 最有竞争力的版本, 甚至在 8shop- Ops- s 8- com- 和 comptal_ comptal_ comptal 上, 甚至 8- s 8- s 8 d d d d d d d d d d d d d d d d d