Vision-language models trained with contrastive learning on large-scale noisy data are becoming increasingly popular for zero-shot recognition problems. In this paper we improve the following three aspects of the contrastive pre-training pipeline: dataset noise, model initialization and the training objective. First, we propose a straightforward filtering strategy titled Complexity, Action, and Text-spotting (CAT) that significantly reduces dataset size, while achieving improved performance across zero-shot vision-language tasks. Next, we propose an approach titled Concept Distillation to leverage strong unimodal representations for contrastive training that does not increase training complexity while outperforming prior work. Finally, we modify the traditional contrastive alignment objective, and propose an importance-sampling approach to up-sample the importance of hard-negatives without adding additional complexity. On an extensive zero-shot benchmark of 29 tasks, our Distilled and Hard-negative Training (DiHT) approach improves on 20 tasks compared to the baseline. Furthermore, for few-shot linear probing, we propose a novel approach that bridges the gap between zero-shot and few-shot performance, substantially improving over prior work. Models are available at https://github.com/facebookresearch/diht.
翻译:通过对大型噪音数据进行对比性学习而培训的视觉语言模型越来越受欢迎,以零点识别问题。在本文件中,我们改进了对比性培训前编审管道的以下三个方面:数据集噪音、模型初始化和培训目标。首先,我们提出一个直截了当的过滤战略,题为 " 复杂性、行动和文本点点 " (CAT),大幅降低数据集大小,同时在零点视觉语言任务中实现更好的业绩。接着,我们提议一个题为 " 概念蒸馏 " 的方法,以利用强力的单式演示,进行对比性培训,这种培训不会增加培训的复杂性,而比以前的工作要强。最后,我们修改传统的对比性调整目标,提出一种重要抽样方法,在不增加复杂性的情况下,对硬性负反作用的重要性进行概括。关于29项任务的广泛零点基准,我们的蒸发式和硬性负作用培训(DiHT)方法比基线改进了20项任务。此外,对于少数直线式调查,我们提议一种新办法,缩小零点和微分数点的成绩之间的差距,大大改进了以前的工作。