We present Fast Language-Image Pre-training (FLIP), a simple and more efficient method for training CLIP. Our method randomly masks out and removes a large portion of image patches during training. Masking allows us to learn from more image-text pairs given the same wall-clock time and contrast more samples per iteration with similar memory footprint. It leads to a favorable trade-off between accuracy and training time. In our experiments on 400 million image-text pairs, FLIP improves both accuracy and speed over the no-masking baseline. On a large diversity of downstream tasks, FLIP dominantly outperforms the CLIP counterparts trained on the same data. Facilitated by the speedup, we explore the scaling behavior of increasing the model size, data size, or training length, and report encouraging results and comparisons. We hope that our work will foster future research on scaling vision-language learning.
翻译:我们提出了快速语言图像预培训(FLIP),这是培训CLIP的一种简单而有效的方法。我们的方法随机地遮盖并删除了培训期间大部分图像补丁。遮罩使我们能够在相同的钟点时间里从更多的图像-文本对口中学习,并且用相似的记忆足迹对每次迭代的样本进行对比。它导致精确度和培训时间之间的有利权衡。在4亿图像-文本对口的实验中,FLIP提高了无质基线的精确度和速度。在大量的下游任务中,FLIP主要超越了就相同数据培训的CLIP对应人员。在加快速度的推动下,我们探索扩大模型大小、数据大小或培训长度的扩大行为,并报告令人鼓舞的结果和比较。我们希望我们的工作将促进今后关于扩大视觉语言学习的研究。