Prompt tuning has been employed as an efficient way to adapt large vision-language pre-trained models (e.g. CLIP) to various downstream tasks in data-limited or label-limited settings. Nonetheless, visual data (e.g., images) is by default prerequisite for learning prompts in existing methods. In this work, we advocate that the effectiveness of image-text contrastive learning in aligning the two modalities (for training CLIP) further makes it feasible to treat texts as images for prompt tuning and introduce TaI prompting. In contrast to the visual data, text descriptions are easy to collect, and their class labels can be directly derived. Particularly, we apply TaI prompting to multi-label image recognition, where sentences in the wild serve as alternatives to images for prompt tuning. Moreover, with TaI, double-grained prompt tuning (TaI-DPT) is further presented to extract both coarse-grained and fine-grained embeddings for enhancing the multi-label recognition performance. Experimental results show that our proposed TaI-DPT outperforms zero-shot CLIP by a large margin on multiple benchmarks, e.g., MS-COCO, VOC2007, and NUS-WIDE, while it can be combined with existing methods of prompting from images to improve recognition performance further. Code is released at https://github.com/guozix/TaI-DPT.
翻译:快速调整已被作为一种有效的方法,使大型视觉语言预先培训模型(如CLIP)适应数据限制或标签限制环境中的各种下游任务。然而,视觉数据(如图像)是在现有方法中学习提示的默认先决条件。在这项工作中,我们主张图像-文字对比学习在调整两种模式(培训CLIP)方面的有效性进一步使得将文本作为图像处理以便迅速调整和引入塔伊推动功能的可行性。与视觉数据相比,文本描述易于收集,其类标签可以直接衍生。特别是,我们应用塔伊促进多标签图像识别,野生判决可以替代图像快速调整。此外,通过塔伊(TAI),双色快速调整(TAI-DPT),进一步将文本作为图像快速调整的图像处理,作为图像进行快速调试和微缩缩嵌入,以加强多标签识别性能。实验结果表明,我们拟议的塔伊-DPT的标签标签标签标签标签标签标签标签标签标签标签可以直接生成,其类标签标签标签标签标签标签标识可以直接产生。特别是,我们使用的多标签图像识别标记图像识别,野生判决作为图像的替代图像的替代。此外加缩缩缩缩缩缩图的图像的图像校准,还可以在2007年版本的自动识别编码中进行升级的升级的升级的升级,同时进行升级的升级。