Most recent self-supervised learning~(SSL) methods are pre-trained on the well-curated ImageNet-1K dataset. In this work, we consider SSL pre-training on noisy web image-text paired data due to the excellent scalability of web data. First, we conduct a benchmark study of representative SSL pre-training methods on large-scale web data in a fair condition. Methods include single-modal ones such as MAE and multi-modal ones such as CLIP. We observe that multi-modal methods cannot outperform single-modal ones on vision transfer learning tasks. We derive an information-theoretical view to explain the benchmarking results, which provides insights into designing novel vision learners. Inspired by the above explorations, we present a visual representation pre-training method, MUlti-modal Generator~(MUG), for scalable web image-text data. MUG achieves state-of-the-art transferring performances on a variety of tasks and shows promising scaling behavior. Models and codes will be made public. Demo available at https://huggingface.co/spaces/tennant/MUG_caption
翻译:最近自我监督的学习~(SSL)方法在完善的图像Net-1K数据集上已经预先培训。 在这项工作中,我们考虑SSL对由于网络数据的可扩展性极强而吵闹的网络图像文本配对数据进行预先培训。首先,我们对具有代表性的SSL在公平条件下对大型网络数据进行初步培训的方法进行基准研究。方法包括单一模式方法,如MAE和多模式方法,如CLIP。我们观察到,多模式方法不能在视觉传输学习任务方面超越单一模式。我们从信息理论的角度来解释基准结果,为设计新的视觉学习者提供洞见。我受上述探索的启发,我们提出了一个视觉代表前培训方法,Multi-modal Gingry ~(MUG),用于可缩放的网络图像文本数据。MUG在各种任务上达到最先进的转让性能,并显示有希望的缩放行为。模型和代码将公布在 https://HOMA/MUG/spacepacefacefaces。