The recent breakthroughs in natural language processing for model pretraining on large quantities of data have opened the way for similar foundation models in computer vision. These models could greatly simplify the use of images in any system by producing all-purpose visual features, i.e., features that work across image distributions and tasks without finetuning. This work shows that existing pretraining methods, especially self-supervised methods, can produce such features if trained on enough curated data from diverse sources. We revisit existing approaches and combine different techniques to scale our pretraining in terms of data and model size. Most of the technical contributions aim at accelerating and stabilizing the training at scale. In terms of data, we propose an automatic pipeline to build a dedicated, diverse, and curated image dataset instead of uncurated data, as typically done in the self-supervised literature. In terms of models, we train a ViT model (Dosovitskiy et al., 2020) with 1B parameters and distill it into a series of smaller models that surpass the best available all-purpose features, OpenCLIP (Ilharco et al., 2021) on most of the benchmarks at image and pixel levels.
翻译:自然语言处理模型预训练在大量数据上取得的突破为计算机视觉中的类似基础模型开辟了新的道路。这些模型可以通过生成通用的视觉特征——即无需微调即可适用于各种图像分布和任务的特征——来极大地简化任何系统中图片的使用。本研究表明,现有的预训练方法,特别是自监督方法,如果在从不同来源的精心筛选的数据上进行足够的训练,则可以产生这样的特征。我们重新审视现有的方法,并结合不同的技术在数据和模型大小方面扩展我们的预训练。大部分技术方案旨在加速和稳定大规模的训练。在数据方面,我们提出自动管道来构建专门的、多样化的、精心筛选的图像数据集,而不是像自监督文献中通常所做的一样使用未经筛选的数据。在模型方面,我们训练了一个具有10亿个参数的ViT模型(Dosovitskiy等人,2020),并将其精简为一系列较小的模型,在大多数基准图像和像素水平上超越了目前最好的通用特征OpenCLIP(Ilharco等人,2021)。