Although significant progress has been made in few-shot learning, most of existing few-shot image classification methods require supervised pre-training on a large amount of samples of base classes, which limits their generalization ability in real world application. Recently, large-scale Vision-Language Pre-trained models (VLPs) have been gaining increasing attention in few-shot learning because they can provide a new paradigm for transferable visual representation learning with easily available text on the Web. However, the VLPs may neglect detailed visual information that is difficult to describe by language sentences, but important for learning an effective classifier to distinguish different images. To address the above problem, we propose a new framework, named Semantic-guided Visual Adapting (SgVA), which can effectively extend vision-language pre-trained models to produce discriminative adapted visual features by comprehensively using an implicit knowledge distillation, a vision-specific contrastive loss, and a cross-modal contrastive loss. The implicit knowledge distillation is designed to transfer the fine-grained cross-modal knowledge to guide the updating of the vision adapter. State-of-the-art results on 13 datasets demonstrate that the adapted visual features can well complement the cross-modal features to improve few-shot image classification.
翻译:尽管在短片学习方面取得了显著进展,但大多数现有的短片图像分类方法都要求就大量基础类样本进行监督的预培训,这限制了他们在现实世界应用中的一般化能力。最近,大规模视觉语言预修模型(VLPs)在短片学习中日益受到重视,因为这些模型可以为可转移的视觉表现学习提供新的范例,并可在网上轻易获得文本。但是,VLPs可能会忽视难以用语言句描述的详细视觉信息,但对于学习一个有效的分类器以区分不同图像十分重要。为了解决上述问题,我们提议了一个新框架,名为Semictic-指导视觉调整(SgVA),它能够有效地扩展视觉预修习模型,通过全面利用隐含的知识蒸馏、针对具体愿景的对比损失和交叉模式的对比损失,从而产生歧视性的经调整的视觉特征。隐性知识蒸馏旨在将精细的跨模式知识转换成一个用于指导对图像的更新。为了应对上述问题,我们提出了一个新的框架,名为Semantical-Guideded-Defal Redual Refal-dal-fal-flational 13 maphal magistrat-s