写作和绘画:产生视觉-语言模型是统一模式学习者 (Write and Paint: Generative Vision-Language Models are Unified Modal Learners)

Recent advances in vision-language pre-training have pushed the state-of-the-art on various vision-language tasks, making machines more capable of multi-modal writing (image-to-text generation) and painting (text-to-image generation). However, few studies investigate if these two essential capabilities can be learned together and boost each other, making a versatile and powerful multi-modal foundation model. In this work, we disclose the potential of symmetric generative vision-language pre-training in learning to write and paint concurrently, and propose a new unified modal model, named DaVinci, trained with prefix language modeling and prefix image modeling, a simple generative self-supervised objective on image-text pairs. Thanks to the proposed prefix multi-modal modeling framework, DaVinci is simple to train, scalable to huge data, adaptable to both writing and painting tasks, and also strong on other vision, text, and multi-modal understanding tasks. DaVinci achieves competitive performance on a wide range of 27 generation/understanding tasks and demonstrates the superiority of combining vision/language generative pre-training. Furthermore, we carefully benchmark the performance of different vision-language pre-training objectives on different scales of pre-training datasets on a heterogeneous and broad distribution coverage. Our results demonstrate the potential of exploiting self-supervision in both language and vision inputs, and establish new, stronger baselines for future comparisons at different data scales. The code and pre-trained models are available at https://github.com/shizhediao/DaVinci.

翻译：最近在视觉-语言培训前的进展推动了各种视觉-语言任务的最新进展,使机器更有能力多式写作(图像-文字生成)和绘画(文字-图像生成),然而,很少有研究调查这两个基本能力能否共同学习并相互促进,形成一个多功能和强大的多模式基础模型。在这项工作中,我们披露了同时学习写作和油漆的对称化视觉-语言培训前培训的潜力,并提出了名为Davinci的新的统一模式模型,该模型经过语言前建模和预型图像建模培训,这是在图像-文本配对方面一个简单的基因化自我监督的简单目标。由于拟议的预型多模式模型框架,Davinci很容易培训,适应巨大的数据,适应于写作和绘画任务,同时在其它视觉、文字和多模式化投入方面,Davinci在27种语言建模和预型图像建模前建模模型上的竞争业绩,在将不同愿景/语言的升级前期数据升级方面,我们在不同的视觉-语言发展前的升级目标上,在不同的视觉-语言前的升级前,在不同的模型-年龄-年龄-层次上,在我们的数据上展示前的升级前的升级,在不同的视觉-前的升级方面,在不同的视觉-前数据上,在不同的层次-前数据上,在不同的层次-前的升级前的高级数据上,在不同的层次-前的高级数据上,我们的数据-前的升级的高级数据上,在不同的层次上,在不同的层次-

相关内容

MoDELS

关注 43

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

NeurlPS 2022 | 自然语言处理相关论文分类整理

专知会员服务

51+阅读 · 2022年10月2日

高效可扩展图神经网络的研究进展，Recent Advances in Efficient and Scalable Graph Neural Networks

专知会员服务

78+阅读 · 2022年3月15日

对比学习简述

专知会员服务

90+阅读 · 2021年6月29日

史上最全！358篇机器学习&自然语言处理综述论文！都这儿了

专知会员服务

129+阅读 · 2020年7月18日