Recent advances in vision-language pre-training have pushed the state-of-the-art on various vision-language tasks, making machines more capable of multi-modal writing (image-to-text generation) and painting (text-to-image generation). However, few studies investigate if these two essential capabilities can be learned together and boost each other, making a versatile and powerful multi-modal foundation model. In this work, we disclose the potential of symmetric generative vision-language pre-training in learning to write and paint concurrently, and propose a new unified modal model, named DaVinci, trained with prefix language modeling and prefix image modeling, a simple generative self-supervised objective on image-text pairs. Thanks to the proposed prefix multi-modal modeling framework, DaVinci is simple to train, scalable to huge data, adaptable to both writing and painting tasks, and also strong on other vision, text, and multi-modal understanding tasks. DaVinci achieves competitive performance on a wide range of 27 generation/understanding tasks and demonstrates the superiority of combining vision/language generative pre-training. Furthermore, we carefully benchmark the performance of different vision-language pre-training objectives on different scales of pre-training datasets on a heterogeneous and broad distribution coverage. Our results demonstrate the potential of exploiting self-supervision in both language and vision inputs, and establish new, stronger baselines for future comparisons at different data scales. The code and pre-trained models are available at https://github.com/shizhediao/DaVinci.
翻译:最近在视觉-语言培训前的进展推动了各种视觉-语言任务的最新进展,使机器更有能力多式写作(图像-文字生成)和绘画(文字-图像生成),然而,很少有研究调查这两个基本能力能否共同学习并相互促进,形成一个多功能和强大的多模式基础模型。在这项工作中,我们披露了同时学习写作和油漆的对称化视觉-语言培训前培训的潜力,并提出了名为Davinci的新的统一模式模型,该模型经过语言前建模和预型图像建模培训,这是在图像-文本配对方面一个简单的基因化自我监督的简单目标。由于拟议的预型多模式模型框架,Davinci很容易培训,适应巨大的数据,适应于写作和绘画任务,同时在其它视觉、文字和多模式化投入方面,Davinci在27种语言建模和预型图像建模前建模模型上的竞争业绩,在将不同愿景/语言的升级前期数据升级方面,我们在不同的视觉-语言发展前的升级目标上,在不同的视觉-语言前的升级前,在不同的模型-年龄-年龄-层次上,在我们的数据上展示前的升级前的升级,在不同的视觉-前的升级方面,在不同的视觉-前数据上,在不同的层次-前数据上,在不同的层次-前的升级前的高级数据上,在不同的层次-前的高级数据上,我们的数据-前的升级的高级数据上,在不同的层次上,在不同的层次-