Vision-language pre-training has recently emerged as a promising alternative for representation learning. It shifts from the tradition of using images and discrete labels for learning a fixed set of weights, seen as visual concepts, to aligning images and raw text for two separate encoders. Such a paradigm benefits from a broader source of supervision and allows zero-shot transfer to downstream tasks since visual concepts can be diametrically generated from natural language, known as prompt. In this paper, we identify that a major challenge of deploying such models in practice is prompt engineering. This is because designing a proper prompt, especially for context words surrounding a class name, requires domain expertise and typically takes a significant amount of time for words tuning since a slight change in wording could have a huge impact on performance. Moreover, different downstream tasks require specific designs, further hampering the efficiency of deployment. To overcome this challenge, we propose a novel approach named context optimization (CoOp). The main idea is to model context in prompts using continuous representations and perform end-to-end learning from data while keeping the pre-trained parameters fixed. In this way, the design of task-relevant prompts can be fully automated. Experiments on 11 datasets show that CoOp effectively turns pre-trained vision-language models into data-efficient visual learners, requiring as few as one or two shots to beat hand-crafted prompts with a decent margin and able to gain significant improvements when using more shots (e.g., at 16 shots the average gain is around 17% with the highest reaching over 50%). CoOp also exhibits strong robustness to distribution shift.
翻译:最近出现了一种前景语言前培训,这是代表制学习的一个很有希望的替代方法。从使用图像和离散标签学习固定重力的传统(被视为视觉概念),转向将图像和原始文字对两个独立的编码器进行对齐。这种范式从更广泛的监督来源中受益,允许将零光转换到下游任务,因为视觉概念可以从自然语言中直接产生,称为快速。在本文中,我们确定在实践中部署这种模型的主要挑战是迅速的工程。这是因为设计一个适当的及时性,特别是一个班级名称周围的上下文词,需要域专门知识,通常需要大量的时间进行文字调整,因为措辞的轻微变化可能对业绩产生巨大影响。此外,不同的下游任务需要具体的设计,进一步妨碍部署效率。为了克服这一挑战,我们建议一种名为背景优化的新办法(Coopp) 。我们的主要想法是利用连续的表述和从数据中进行端到端到端学习,同时保持预先训练的参数。在这种方式上,任务相关提示的设计需要50个字的调,通常需要大量时间,因为措辞的调整可能会对业绩产生很大的影响。在11个视觉模型上,可以有效地进行实验,在16个方向上进行。