Large pre-trained vision-language models like CLIP have shown great potential in learning representations that are transferable across a wide range of downstream tasks. Different from the traditional representation learning that is based mostly on discretized labels, vision-language pre-training aligns images and texts in a common feature space, which allows zero-shot transfer to any downstream task via \emph{prompting}, i.e., classification weights are synthesized from natural language describing classes of interest. In this work, we show that a major challenge for deploying such models in practice is prompt engineering, which requires domain expertise and is extremely time-consuming -- one needs to spend a significant amount of time on words tuning since a slight change in wording could have a huge impact on performance. Inspired by recent advances in prompt learning research in natural language processing (NLP), we propose \emph{Context Optimization (CoOp)}, a simple approach specifically for adapting CLIP-like vision-language models for downstream image recognition. Concretely, CoOp models a prompt's context words with learnable vectors while the entire pre-trained parameters are kept fixed. To handle different image recognition tasks, we provide two implementations of CoOp: unified context and class-specific context. Through extensive experiments on 11 datasets, we demonstrate that CoOp requires as few as one or two shots to beat hand-crafted prompts with a decent margin and is able to gain significant improvements when using more shots, e.g., with 16 shots the average gain is around 15\% (with the highest reaching over 45\%). Despite being a learning-based approach, CoOp achieves superb domain generalization performance compared with the zero-shot model using hand-crafted prompts.
翻译:类似于 CLIP 的大型预培训前视觉语言模型在学习展示方面显示出巨大的潜力,这些模型可以在一系列广泛的下游任务中进行可转让。与传统的代表性学习不同,传统代表性学习主要基于离散标签、愿景语言前培训匹配图像和在共同特性空间的文本,允许通过 emph{prompting} 将零发转换到任何下游任务,也就是说,分类权重从描述感兴趣的类别的自然语言中合成。在这项工作中,我们表明,在实践中部署这种模型的主要挑战在于迅速工程,这需要领域专长,而且耗时非常费时 -- -- 需要花大量时间进行文字调整,因为文字稍稍有变化可能会对绩效产生巨大影响。由于在自然语言处理(NLP)的快速学习研究方面最近有所进展,我们建议通过 emph{Context Opptimation (COop), 一种简单的方法,具体地将CLIP 类似视觉语言模型改编程模型用于下游图像识别。 具体来说,COP 快速地模型将获得两个可学习的上下文词词词词, 而整个O 将持续运行运行运行中要持续进行一个不同的图像测试。