Large pre-trained vision-language models like CLIP have shown great potential in learning representations that are transferable across a wide range of downstream tasks. Different from the traditional representation learning that is based mostly on discretized labels, vision-language pre-training aligns images and texts in a common feature space, which allows zero-shot transfer to a downstream task via prompting, i.e., classification weights are synthesized from natural language describing classes of interest. In this work, we show that a major challenge for deploying such models in practice is prompt engineering, which requires domain expertise and is extremely time-consuming -- one needs to spend a significant amount of time on words tuning since a slight change in wording could have a huge impact on performance. Inspired by recent advances in prompt learning research in natural language processing (NLP), we propose Context Optimization (CoOp), a simple approach specifically for adapting CLIP-like vision-language models for downstream image recognition. Concretely, CoOp models a prompt's context words with learnable vectors while the entire pre-trained parameters are kept fixed. To handle different image recognition tasks, we provide two implementations of CoOp: unified context and class-specific context. Through extensive experiments on 11 datasets, we demonstrate that CoOp requires as few as one or two shots to beat hand-crafted prompts with a decent margin and is able to gain significant improvements over prompt engineering with more shots, e.g., with 16 shots the average gain is around 15% (with the highest reaching over 45%). Despite being a learning-based approach, CoOp achieves superb domain generalization performance compared with the zero-shot model using hand-crafted prompts.
翻译:类似于 CLIP 的大型经事先培训的视觉语言模型在学习展示方面显示出巨大的潜力,这些模型在学习展示方面具有巨大的潜力,在一系列广泛的下游任务中可转让。不同于传统的代表性学习,这些传统学习主要基于离散标签、愿景语言前培训前的匹配图像和文本,在共同特性空间中,允许通过提示(即分类权重从描述兴趣类别的自然语言中合成)向下游任务零发转移,也就是说,分类权重从描述兴趣类别的自然语言中合成。在这项工作中,我们显示在实践中部署这种模型的主要挑战是迅速工程学,这需要域内的专门知识,而且非常耗时 -- -- 需要花大量时间在文字调整上,因为措辞的微小变化可能会对业绩产生巨大影响。在自然语言处理(NLP) 快速学习研究的最新进展激励下游任务(Coop) 环境优化(Cootimation) (CLIP 类似视觉语言模型从自然语言中合成的简单方法,用于下游图像识别。 具体地, Coop 模型是快速使用基于可学习的页边际矢量语言的文字语言语言语言,同时保持整个统一的参数。 和整个统一的参数固定。要处理不同的图像背景,我们用不同的图像识别,通过两个轨道进行不同的图像的快速操作进行不同的图像测试,我们进行不同的图像的实验,我们提供一个快速操作, 。