Web-crawled datasets have enabled remarkable generalization capabilities in recent image-text models such as CLIP (Contrastive Language-Image pre-training) or Flamingo, but little is known about the dataset creation processes. In this work, we introduce a testbed of six publicly available data sources - YFCC, LAION, Conceptual Captions, WIT, RedCaps, Shutterstock - to investigate how pre-training distributions induce robustness in CLIP. We find that the performance of the pre-training data varies substantially across distribution shifts, with no single data source dominating. Moreover, we systematically study the interactions between these data sources and find that combining multiple sources does not necessarily yield better models, but rather dilutes the robustness of the best individual data source. We complement our empirical findings with theoretical insights from a simple setting, where combining the training data also results in diluted robustness. In addition, our theoretical model provides a candidate explanation for the success of the CLIP-based data filtering technique recently employed in the LAION dataset. Overall our results demonstrate that simply gathering a large amount of data from the web is not the most effective way to build a pre-training dataset for robust generalization, necessitating further study into dataset design. Code is available at https://github.com/mlfoundations/clip_quality_not_quantity.
翻译:网络拖动的数据集使得最近的图像文本模型,如CLIP(CLIP(培训前语言图像测试)或Flamingo,能够显著的概括化能力,如最近的图像文本模型,如CLIP(培训前语言图像测试)或Flammingo,但对于数据集创建过程却鲜为人知。在这项工作中,我们引入了6个公开数据源的测试台——YFCC、LAION、概念说明、WIT、REDCaps、Shutterstock——以调查培训前分发如何在CLIP中产生稳健性。此外,我们发现培训前的数据数据在分布变化中差异很大,没有单一的数据源主导。此外,我们系统研究这些数据源之间的相互作用,发现将多个来源合并不一定产生更好的模型,而是淡化了最佳个人数据源的稳健性。我们用简单的环境的理论洞察来补充我们的经验结论,其中将培训数据合并起来也会削弱稳健性。此外,我们的理论模型为基于CLIP的数据过滤技术的成功提供了候选解释,最近在LION数据集集中采用的技术。总体结果表明,仅仅收集大量现有数据在一般设计中进行大量数据质量研究时无法进一步构建。