WuDaoMMM:培训前模型的大型多模式数据集 (WuDaoMM: A large-scale Multi-Modal Dataset for Pre-training models)

Compared with the domain-specific model, the vision-language pre-training models (VLPMs) have shown superior performance on downstream tasks with fast fine-tuning process. For example, ERNIE-ViL, Oscar and UNIMO trained VLPMs with a uniform transformers stack architecture and large amounts of image-text paired data, achieving remarkable results on downstream tasks such as image-text reference(IR and TR), vision question answering (VQA) and image captioning (IC) etc. During the training phase, VLPMs are always fed with a combination of multiple public datasets to meet the demand of large-scare training data. However, due to the unevenness of data distribution including size, task type and quality, using the mixture of multiple datasets for model training can be problematic. In this work, we introduce a large-scale multi-modal corpora named WuDaoMM, totally containing more than 650M image-text pairs. Specifically, about 600 million pairs of data are collected from multiple webpages in which image and caption present weak correlation, and the other 50 million strong-related image-text pairs are collected from some high-quality graphic websites. We also release a base version of WuDaoMM with 5 million strong-correlated image-text pairs, which is sufficient to support the common cross-modal model pre-training. Besides, we trained both an understanding and a generation vision-language (VL) model to test the dataset effectiveness. The results show that WuDaoMM can be applied as an efficient dataset for VLPMs, especially for the model in text-to-image generation task. The data is released at https://data.wudaoai.cn

翻译：与具体领域的模型相比,视觉语言培训前模型(VLPMs)显示,在下游任务上,通过快速微调程序,在下游任务上,VLPMs(VLPMs)表现优异。例如,ERNIE-VIE-VIL、Oscar和UNIMO培训VLPMPMs,使用统一的变压器堆叠结构以及大量图像文本配对数据,在下游任务上取得了显著成果,如图像文本参考(IR和TR)、视觉回答(VQA)和图像说明等。在培训阶段,VLMMMPMs总是结合多种公共数据集,以满足大型护理培训数据的需求。然而,由于数据分布不均,包括大小、任务类型和质量,使用多种数据集组合进行模型培训,对模型培训的VLPMMMM(VQA)和图像版本(WOMML)的快速版本,我们从多个图像应用了5000万张高版本的图像版本,我们又从一个高版本的版本的图像测试数据,我们用5MMD(VL)的版本的版本的版本的版本的图像数据数据数据数据是用来收集。

相关内容

MoDELS

关注 44

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/