In this work, we construct the largest dataset for multimodal pretraining in Chinese, which consists of over 1.9TB images and 292GB texts that cover a wide range of domains. We propose a cross-modal pretraining method called M6, referring to Multi-Modality to Multi-Modality Multitask Mega-transformer, for unified pretraining on the data of single modality and multiple modalities. We scale the model size up to 10 billion and 100 billion parameters, and build the largest pretrained model in Chinese. We apply the model to a series of downstream applications, and demonstrate its outstanding performance in comparison with strong baselines. Furthermore, we specifically design a downstream task of text-guided image generation, and show that the finetuned M6 can create high-quality images with high resolution and abundant details.
翻译:在这项工作中,我们构建了中国多式联运预培训的最大数据集,该数据集由1.9TB图像和292GB文本组成,涵盖广泛的领域。我们提出了一种跨模式预培训方法,称为M6,指多式多式多式多塔斯克Mega- Transtregener,用于单一模式和多种模式数据的统一预培训。我们将模型的大小扩大到100亿和1000亿参数,并建立了中国最大的预培训模型。我们将该模型应用于一系列下游应用,并展示其与强力基线相比的杰出性能。此外,我们特别设计了一个文本制导图像生成的下游任务,并表明微调的M6能够产生高分辨率和丰富细节的高质量图像。