Contrastively trained language-image models such as CLIP, ALIGN, and BASIC have demonstrated unprecedented robustness to multiple challenging natural distribution shifts. Since these language-image models differ from previous training approaches in several ways, an important question is what causes the large robustness gains. We answer this question via a systematic experimental investigation. Concretely, we study five different possible causes for the robustness gains: (i) the training set size, (ii) the training distribution, (iii) language supervision at training time, (iv) language supervision at test time, and (v) the contrastive loss function. Our experiments show that the more diverse training distribution is the main cause for the robustness gains, with the other factors contributing little to no robustness. Beyond our experimental results, we also introduce ImageNet-Captions, a version of ImageNet with original text annotations from Flickr, to enable further controlled experiments of language-image training.
翻译:语言形象模型(如CLIP、ALIGN和BASIC)对多种具有挑战性的自然分布变化表现出前所未有的强健性。由于这些语言形象模型与以前的培训方法在几个方面不同,因此一个重要的问题是,什么是巨大的强健收益。我们通过系统的实验性调查来回答这个问题。具体地说,我们研究了稳健收益的五种可能的不同原因:(一) 培训设置规模,(二) 培训分布,(三) 培训时间的语言监督,(四) 测试时间的语言监督,(五) 对比性损失功能。我们的实验表明,更加多样化的培训分布是稳健收益的主要原因,而其他因素对稳健无动作用。除了我们的实验性结果外,我们还引入了图像网络功能,即具有Flickr原始文字说明的图像网络版本,以便能够进一步控制语言形象培训的实验。