Compared with the domain-specific model, the vision-language pre-training models (VLPMs) have shown superior performance on downstream tasks with fast fine-tuning process. For example, ERNIE-ViL, Oscar and UNIMO trained VLPMs with a uniform transformers stack architecture and large amounts of image-text paired data, achieving remarkable results on downstream tasks such as image-text reference(IR and TR), vision question answering (VQA) and image captioning (IC) etc. During the training phase, VLPMs are always fed with a combination of multiple public datasets to meet the demand of large-scare training data. However, due to the unevenness of data distribution including size, task type and quality, using the mixture of multiple datasets for model training can be problematic. In this work, we introduce a large-scale multi-modal corpora named WuDaoMM, totally containing more than 650M image-text pairs. Specifically, about 600 million pairs of data are collected from multiple webpages in which image and caption present weak correlation, and the other 50 million strong-related image-text pairs are collected from some high-quality graphic websites. We also release a base version of WuDaoMM with 5 million strong-correlated image-text pairs, which is sufficient to support the common cross-modal model pre-training. Besides, we trained both an understanding and a generation vision-language (VL) model to test the dataset effectiveness. The results show that WuDaoMM can be applied as an efficient dataset for VLPMs, especially for the model in text-to-image generation task. The data is released at https://data.wudaoai.cn
翻译:与具体领域的模型相比,视觉语言培训前模型(VLPMs)显示,在下游任务上,通过快速微调程序,在下游任务上,VLPMs(VLPMs)表现优异。例如,ERNIE-VIE-VIL、Oscar和UNIMO培训VLPMPMs,使用统一的变压器堆叠结构以及大量图像文本配对数据,在下游任务上取得了显著成果,如图像文本参考(IR和TR)、视觉回答(VQA)和图像说明等。在培训阶段,VLMMMPMs总是结合多种公共数据集,以满足大型护理培训数据的需求。然而,由于数据分布不均,包括大小、任务类型和质量,使用多种数据集组合进行模型培训,对模型培训的VLPMMMM(VQA)和图像版本(WOMML)的快速版本,我们从多个图像应用了5000万张高版本的图像版本,我们又从一个高版本的版本的图像测试数据,我们用5MMD(VL)的版本的版本的版本的版本的图像数据数据数据数据是用来收集。