Large Vision-Language Foundation Models (VLFM), such as CLIP, ALIGN and Florence, are trained on large-scale datasets of image-caption pairs and achieve superior transferability and robustness on downstream tasks, but they are difficult to use in many practical applications due to their large size, high latency and fixed architectures. Unfortunately, recent work shows training a small custom VLFM for resource-limited applications is currently very difficult using public and smaller-scale data. In this paper, we introduce a new distillation mechanism (DIME-FM) that allows us to transfer the knowledge contained in large VLFMs to smaller, customized foundation models using a relatively small amount of inexpensive, unpaired images and sentences. We transfer the knowledge from the pre-trained CLIP-ViTL/14 model to a ViT-B/32 model, with only 40M public images and 28.4M unpaired public sentences. The resulting model "Distill-ViT-B/32" rivals the CLIP-ViT-B/32 model pre-trained on its private WiT dataset (400M image-text pairs): Distill-ViT-B/32 achieves similar results in terms of zero-shot and linear-probing performance on both ImageNet and the ELEVATER (20 image classification tasks) benchmarks. It also displays comparable robustness when evaluated on five datasets with natural distribution shifts from ImageNet.
翻译:大规模的视觉-语言基础模型(VLFM)(如CLIP、ALIGN和Florence)在大规模图像-标题对数据集上训练,并在下游任务中实现了超强的可传递性和鲁棒性,但由于其尺寸庞大、延迟高和固定结构,难以在许多实际应用中使用。不幸的是,最近的研究表明,使用公共数据和较小规模的数据集训练适用于资源有限应用的小型自定义VLFM目前非常困难。在本文中,我们引入了一种新的蒸馏机制(DIME-FM),允许我们使用相对较少且廉价的无配对图像和句子转移大型VLFM中包含的知识到更小、定制的基础模型中。我们将来自预训练的CLIP-ViTL/14模型的知识转移给ViT-B/32模型,仅使用了4000万个公共图像和2840万个无配对的公共句子。得到的模型“Distill-ViT-B/32”与在其私人WiT数据集(4亿个图像-文本对)上预训练的CLIP-ViT-B/32模型相媲美:Distill-ViT-B/32在ImageNet和ELEVATER(20个图像分类任务)基准测试中实现了类似的零样本和线性探测性能。它还在评估具有从ImageNet自然分布漂移的五个数据集时显示出可比较的鲁棒性。