SgVA-CLIP: 用于微小图像分类的视觉语言模型的语义-指导视觉适应 (SgVA-CLIP: Semantic-guided Visual Adapting of Vision-Language Models for Few-shot Image Classification)

Although significant progress has been made in few-shot learning, most of existing few-shot learning methods require supervised pre-training on a large amount of samples of base classes, which limits their generalization ability in real world application. Recently, large-scale self-supervised vision-language models (e.g., CLIP) have provided a new paradigm for transferable visual representation learning. However, the pre-trained VLPs may neglect detailed visual information that is difficult to describe by language sentences, but important for learning an effective classifier in few-shot classification. To address the above problem, we propose a new framework, named Semantic-guided Visual Adapting (SgVA), which can effectively extend vision-language pre-trained models to produce discriminative task-specific visual features by comprehensively using a vision-specific contrastive loss, a cross-modal contrastive loss, and an implicit knowledge distillation. The implicit knowledge distillation is designed to transfer the fine-grained cross-modal knowledge to guide the updating of the vision adapter. State-of-the-art results on 13 datasets demonstrate that the adapted visual features can well complement the cross-modal features to improve few-shot image classification.

翻译：尽管在几近的学习方面取得了显著进展,但大多数现有的微小学习方法要求就大量基础班样本进行监督的预培训,这限制了他们在现实世界应用中的普及能力。最近,大规模自我监督的视觉语言模型(如CLIP)为可转移的视觉表现学习提供了新的范例。不过,经过预先培训的VLP可能忽视了难以用语言句子描述但对于在微小的分类中学习有效分类非常重要的详细视觉信息。为了解决上述问题,我们提议了一个名为“语义指导视觉适应”的新框架(SgVA),它能够有效地扩展视觉语言预先培训模型,通过全面使用针对视力的对比损失、交叉调整的对比损失和隐含的知识蒸馏来产生具有歧视性的特定任务视觉特征。隐含的知识蒸馏旨在将精细的跨模式知识传授给指导视野调整器的更新工作。13个数据集中的最新结果显示,经过调整的视觉变化的图像特征可以很好地补充各种图像分类。

相关内容

小样本学习

关注 215

小样本学习（Few-Shot Learning，以下简称 FSL ）用于解决当可用的数据量比较少时，如何提升神经网络的性能。在 FSL 中，经常用到的一类方法被称为 Meta-learning。和普通的神经网络的训练方法一样，Meta-learning 也包含训练过程和测试过程，但是它的训练过程被称作 Meta-training 和 Meta-testing。

NeurlPS 2022 | 自然语言处理相关论文分类整理

专知会员服务

51+阅读 · 2022年10月2日

【CVPR 2022】基于层次化视觉语言知识蒸馏的开放词汇单阶段检测，Improving Visual Grounding with Visual-Linguistic Verification and Iterative Reasoning

专知会员服务

7+阅读 · 2022年3月19日

史上最全！358篇机器学习&自然语言处理综述论文！都这儿了

专知会员服务

129+阅读 · 2020年7月18日

100+篇《自监督学习(Self-Supervised Learning)》论文最新合集

专知会员服务

165+阅读 · 2020年3月18日