Large-scale contrastive vision-language pre-training has shown significant progress in visual representation learning. Unlike traditional visual systems trained by a fixed set of discrete labels, a new paradigm was introduced in \cite{radford2021learning} to directly learn to align images with raw texts in an open-vocabulary setting. On downstream tasks, a carefully chosen text prompt is employed to make zero-shot predictions.~To avoid non-trivial prompt engineering, context optimization \cite{zhou2021coop} has been proposed to learn continuous vectors as task-specific prompts with few-shot training examples.~In this paper, we show that there is an alternative path to achieve better vision-language models other than prompt tuning.~While prompt tuning is for the textual inputs, we propose CLIP-Adapter to conduct fine-tuning with feature adapters on either visual or language branch. Specifically, CLIP-Adapter adopts an additional bottleneck layer to learn new features and performs residual-style feature blending with the original pre-trained features.~As a consequence, CLIP-Adapter is able to outperform context optimization while maintains a simple design. Experiments and extensive ablation studies on various visual classification tasks demonstrate the effectiveness of our approach.
翻译:大型对比式视觉语言前培训在视觉表现学习方面显示出显著的进展。 与由一组固定的离散标签培训的传统视觉系统不同, 在 & cite{radford2021学习} 中引入了一种新的模式, 直接学习将图像与原始文本在开放的词汇环境中相匹配。 在下游任务中, 使用精心选择的文本提示来进行零射预测。 ~ 为了避免非三端快速工程, 环境优化{ Chouite2021cooop} 已经建议学习连续的矢量作为任务特定提示, 并举几个例子。 ~ 在本文中, 我们显示除了快速调换外, 还可以找到更好的视觉语言模型模式。 ~ 尽管对文本输入进行快速调整, 我们建议使用 CLIP- Adapter 与视觉或语言分支的功能适应器进行微调。 具体来说, CLIP- Aapter 采用了额外的瓶颈层, 学习新的特征, 并进行与原始培训前的特征混合的残余式特征 ~ AAs a a imalimalimalimal a resulation a resulation a laction a laction appliction a magy laction a magy laction apply macald ex ex magiction macaldaldaldald ex ex ex ex ex ex ex ex ex ex laction a ex.