Contrastive Language-Image Pre-training (CLIP) has drawn increasing attention recently for its transferable visual representation learning. However, due to the semantic gap within datasets, CLIP's pre-trained image-text alignment becomes sub-optimal on downstream tasks, which severely harms its transferring performance. To better adapt the cross-modality embedding space, we propose to enhance CLIP via Visual-guided Texts, named VT-CLIP. Specifically, we guide textual features of different categories to adaptively explore informative regions on the image and aggregate visual features by attention mechanisms. In this way, the texts become visual-guided, namely, more semantically correlated with downstream images, which greatly benefits the category-wise matching process. In few-shot settings, we evaluate our VT-CLIP on 11 well-known classification datasets to demonstrate its effectiveness.
翻译:然而,由于数据集中的语义差异,CLIP经过事先训练的图像文本调整在下游任务上变得不尽如人意,严重危害其转移性能。为了更好地调整跨模式嵌入空间,我们提议通过视觉指导文字(名为VT-CLIP)加强跨模式嵌入空间。具体地说,我们指导不同类别的文字特征,以便通过关注机制适应性地探索关于图像和综合视觉特征的信息区域。通过这种方式,文本成为视觉指导,即与下游图像更具语义相关性,这极大地有利于类别匹配进程。在几个场景中,我们用11个众所周知的分类数据集来评估我们的VT-CLIP,以展示其有效性。