The availability of large-scale image captioning and visual question answering datasets has contributed significantly to recent successes in vision-and-language pre-training. However, these datasets are often collected with overrestrictive requirements inherited from their original target tasks (e.g., image caption generation), which limit the resulting dataset scale and diversity. We take a step further in pushing the limits of vision-and-language pre-training data by relaxing the data collection pipeline used in Conceptual Captions 3M (CC3M) [Sharma et al. 2018] and introduce the Conceptual 12M (CC12M), a dataset with 12 million image-text pairs specifically meant to be used for vision-and-language pre-training. We perform an analysis of this dataset and benchmark its effectiveness against CC3M on multiple downstream tasks with an emphasis on long-tail visual recognition. Our results clearly illustrate the benefit of scaling up pre-training data for vision-and-language tasks, as indicated by the new state-of-the-art results on both the nocaps and Conceptual Captions benchmarks.
翻译:大规模图像字幕和视觉问题解答数据集的可用性极大地促进了近期在视力和语言培训前取得的成功,然而,这些数据集的收集往往具有从最初目标任务(例如图像字幕生成)继承下来的过度限制性要求,限制了由此产生的数据集规模和多样性;我们进一步推进了愿景和语言培训前数据的局限性,放宽了概念3M(CC3M)[Sharma等人,2018年]中使用的数据收集管道,并引入了概念12M(CC12M),这是一套由1 200万对图像文本组成的数据集,专门用于愿景和语言培训前培训。我们对这一数据集进行了分析,并根据CC3M在多个下游任务上的效力进行了基准评估,重点是长期的视觉识别。我们的结果清楚地表明了扩大愿景和语言任务培训前数据规模的好处,正如关于无顶和概念字幕基准的新最新结果所示。