Contrastively trained image-text models such as CLIP, ALIGN, and BASIC have demonstrated unprecedented robustness to multiple challenging natural distribution shifts. Since these image-text models differ from previous training approaches in several ways, an important question is what causes the large robustness gains. We answer this question via a systematic experimental investigation. Concretely, we study five different possible causes for the robustness gains: (i) the training set size, (ii) the training distribution, (iii) language supervision at training time, (iv) language supervision at test time, and (v) the contrastive loss function. Our experiments show that the more diverse training distribution is the main cause for the robustness gains, with the other factors contributing little to no robustness. Beyond our experimental results, we also introduce ImageNet-Captions, a version of ImageNet with original text annotations from Flickr, to enable further controlled experiments of language-image training.
翻译:类似CLIP、ALIGN和BASIC等经过相应培训的图像文本模型对多重挑战性自然分布变化表现出前所未有的强健性。 由于这些图像文本模型与先前的培训方法在几个方面不同,一个重要问题是,什么是大幅稳健性收益的原因。我们通过系统的实验性调查来回答这个问题。具体地说,我们研究了稳健性收益的五种可能的不同原因:(一) 培训设置规模,(二) 培训分布,(三) 培训时间的语言监督,(四) 测试时间的语言监督,以及(五) 对比性损失功能。我们的实验表明,更为多样化的培训分布是稳健性收益的主要原因,而其他因素对稳健性贡献甚微。除了我们的实验性结果外,我们还引入图像网络功能,这是具有Flickr原始文字说明的图像网络的版本,以便能够进一步控制语言图像培训的实验。