Visual recognition in low-data regimes requires deep neural networks to learn generalized representations from limited training samples. Recently, CLIP-based methods have shown promising few-shot performance benefited from the contrastive language-image pre-training. We then question, if the more diverse pre-training knowledge can be cascaded to further assist few-shot representation learning. In this paper, we propose CaFo, a Cascade of Foundation models that incorporates diverse prior knowledge of various pre-training paradigms for better few-shot learning. Our CaFo incorporates CLIP's language-contrastive knowledge, DINO's vision-contrastive knowledge, DALL-E's vision-generative knowledge, and GPT-3's language-generative knowledge. Specifically, CaFo works by 'Prompt, Generate, then Cache'. Firstly, we leverage GPT-3 to produce textual inputs for prompting CLIP with rich downstream linguistic semantics. Then, we generate synthetic images via DALL-E to expand the few-shot training data without any manpower. At last, we introduce a learnable cache model to adaptively blend the predictions from CLIP and DINO. By such collaboration, CaFo can fully unleash the potential of different pre-training methods and unify them to perform state-of-the-art for few-shot classification. Code is available at https://github.com/ZrrSkywalker/CaFo.
翻译:低数据系统中的视觉认知要求深层的神经网络从有限的培训样本中学习通用的描述。 最近,基于 CLIP 的方法显示,从对比性语言图像培训前的训练中获得了有希望的微小表现。 然后,我们质问,如果更多样化的培训前知识能够逐步升级,以进一步帮助少数描述性学习。在本论文中,我们建议CaFo,这是基金会的系列模型,它包含各种培训前模式的多样化知识,以更好地进行少见的学习。然后,我们通过 DALL-E 生成合成图像,以扩大CLIP 的语言内容知识、DINO 的视觉-调频知识、DALL-E 的视觉生成知识和GPT-3 的语言生成知识。 具体地说,CFo 工作“Prompt, Generate, 然后是Cachech'。 我们利用GPT-3 来提供文本投入,用丰富的下游语言语义学。 然后,我们通过 DALL-E 生成合成图像, 来在没有人力的情况下扩大少发式培训数据数据。 最后,我们引入了“DAL-E”DL-GL-GIP-G-D-D-D-S-D-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-I-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-</s>