Self supervision and natural language supervision have emerged as two exciting ways to train general purpose image encoders which excel at a variety of downstream tasks. Recent works such as M3AE and SLIP have suggested that these approaches can be effectively combined, but most notably their results use small pre-training datasets (<50M samples) and don't effectively reflect the large-scale regime (>100M examples) that is commonly used for these approaches. Here we investigate whether a similar approach can be effective when trained with a much larger amount of data. We find that a combination of two state of the art approaches: masked auto-encoders, MAE and contrastive language image pre-training, CLIP provides a benefit over CLIP when trained on a corpus of 11.3M image-text pairs, but little to no benefit (as evaluated on a suite of common vision tasks) over CLIP when trained on a large corpus of 1.4B images. Our work provides some much needed clarity into the effectiveness (or lack thereof) of self supervision for large-scale image-text training.
翻译:自我监督和自然语言监督是培训通用图像编码器的两种令人兴奋的方法,在一系列下游任务中十分出色,例如M3AE和SLIP等最近的工作表明,这些方法可以有效地结合起来,但最显著的是,其结果使用小型培训前数据集(<50M样本),不能有效地反映这些方法通常使用的大规模制度(>100M实例),我们在这里调查,在培训大量数据时,类似方法是否有效。我们发现,两种先进的方法相结合:蒙面自动编码器、MAE和对比语言图像预培训前,CLIP在接受11.3M图像文本组合培训时,对CLIP是一种好处,但在接受大型图像文本培训时,对CLIP的自我监督效力(或缺乏这种效力)非常需要澄清。