Few-shot classification requires deep neural networks to learn generalized representations only from limited training images, which is challenging but significant in low-data regimes. Recently, CLIP-based methods have shown promising few-shot performance benefited from the contrastive language-image pre-training. Based on this point, we question if the large-scale pre-training can alleviate the few-shot data deficiency and also assist the representation learning by the pre-learned knowledge. In this paper, we propose CoMo, a Collaboration of pre-trained Models that incorporates diverse prior knowledge from various pre-training paradigms for better few-shot learning. Our CoMo includes: CLIP's language-contrastive knowledge, DINO's vision-contrastive knowledge, and DALL-E's language-generative knowledge. Specifically, CoMo works in two aspects: few-shot data expansion and diverse knowledge ensemble. For one, we generate synthetic images via zero-shot DALL-E to enrich the few-shot training data without any manpower. For the other, we introduce a learnable Multi-Knowledge Adapter (MK-Adapter) to adaptively blend the predictions from CLIP and DINO. By such collaboration, CoMo can fully unleash the potential of different pre-training methods and unify them to perform state-of-the-art for few-shot classification. We conduct extensive experiments on 11 datasets to demonstrate the superiority and generalization ability of our approach.
翻译:微小的分类要求深层的神经网络从有限的培训图像中学习普遍化的表述,这具有挑战性,但在低数据制度中意义重大。最近,基于CLIP的方法显示,从对比性语言图像培训前的训练前知识中获益的几分有希望的实绩。基于这一点,我们质疑大规模培训前培训能否减轻少发数据缺陷,并协助通过预学知识学习代表性。在本文件中,我们提议CoMo, 由预培训模型协作, 将各种培训前模式的多样化知识纳入培训前模式, 以更好地进行少见的学习。我们CLIP包括:CLIP的语言调频知识、DINO的视觉-调频知识以及DAL-E的语言生成知识。具体地说,CMomoo在两个方面起作用:少发数据扩展和多种知识知识的学习。我们通过零发DALL-E生成合成图像, 以在没有人力的情况下丰富少发培训方法来丰富少发的培训数据。在另一个方面,我们引入了一种可学习的多发性多发性语言适应性适应性适应能力,DO-By 和Sentlical-lical-lical-lical Alling the ex-pal-taiming the sligyal-pal-ligal-dal-ligal-dal-ligyal-ligal-ligal-ligalizal-s) 11-s-s-toal-dal-toal-s