Although significant progress has been made in few-shot learning, most of existing few-shot learning methods require supervised pre-training on a large amount of samples of base classes, which limits their generalization ability in real world application. Recently, large-scale self-supervised vision-language models (e.g., CLIP) have provided a new paradigm for transferable visual representation learning. However, the pre-trained VLPs may neglect detailed visual information that is difficult to describe by language sentences, but important for learning an effective classifier in few-shot classification. To address the above problem, we propose a new framework, named Semantic-guided Visual Adapting (SgVA), which can effectively extend vision-language pre-trained models to produce discriminative task-specific visual features by comprehensively using a vision-specific contrastive loss, a cross-modal contrastive loss, and an implicit knowledge distillation. The implicit knowledge distillation is designed to transfer the fine-grained cross-modal knowledge to guide the updating of the vision adapter. State-of-the-art results on 13 datasets demonstrate that the adapted visual features can well complement the cross-modal features to improve few-shot image classification.
翻译:尽管在几近的学习方面取得了显著进展,但大多数现有的微小学习方法要求就大量基础班样本进行监督的预培训,这限制了他们在现实世界应用中的普及能力。最近,大规模自我监督的视觉语言模型(如CLIP)为可转移的视觉表现学习提供了新的范例。不过,经过预先培训的VLP可能忽视了难以用语言句子描述但对于在微小的分类中学习有效分类非常重要的详细视觉信息。为了解决上述问题,我们提议了一个名为“语义指导视觉适应”的新框架(SgVA),它能够有效地扩展视觉语言预先培训模型,通过全面使用针对视力的对比损失、交叉调整的对比损失和隐含的知识蒸馏来产生具有歧视性的特定任务视觉特征。隐含的知识蒸馏旨在将精细的跨模式知识传授给指导视野调整器的更新工作。13个数据集中的最新结果显示,经过调整的视觉变化的图像特征可以很好地补充各种图像分类。