Most recent self-supervised learning methods are pre-trained on the well-curated ImageNet-1K dataset. In this work, given the excellent scalability of web data, we consider self-supervised pre-training on noisy web sourced image-text paired data. First, we conduct a benchmark study of representative self-supervised pre-training methods on large-scale web data in a like-for-like setting. We compare a range of methods, including single-modal ones that use masked training objectives and multi-modal ones that use image-text constrastive training. We observe that existing multi-modal methods do not outperform their single-modal counterparts on vision transfer learning tasks. We derive an information-theoretical view to explain these benchmark results, which provides insight into how to design a novel vision learner. Inspired by this insight, we present a new visual representation pre-training method, MUlti-modal Generator~(MUG), that learns from scalable web sourced image-text data. MUG achieves state-of-the-art transfer performance on a variety of tasks and demonstrates promising scaling properties. Pre-trained models and code will be made public upon acceptance.
翻译:最近的自监督学习方法都是在经过精心筛选的ImageNet-1K数据集上进行预训练的。在这项工作中,鉴于Web数据的卓越可扩展性,我们考虑在嘈杂的Web来源的图像-文本成对数据上进行自监督预训练。首先,在类似的设置下,我们在大规模Web数据上进行代表性自监督预训练方法的基准研究。我们比较了一系列方法,包括使用屏蔽训练目标的单模方法和使用图像-文本对比训练的多模方法。我们观察到现有的多模方法在视觉传递学习任务上并没有超过它们的单模对应物。我们推导了一个信息理论视图来解释这些基准结果,它提供了如何设计新的视觉学习者的见解。受这个见解的启发,我们提出了一种新的视觉表示预训练方法,即多模式生成器(MUlti-modal Generator,MUG),它从可伸缩的Web源图像-文本数据中学习。MUG在各种任务上实现了最先进的转移性能,并展示了可扩展性的有希望的特性。预训练模型和代码将在被接受后公开。