内容 Rich 文本到图像生成的自动递减模式 (Scaling Autoregressive Models for Content-Rich Text-to-Image Generation)

Jiahui Yu,Yuanzhong Xu,Jing Yu Koh,Thang Luong,Gunjan Baid,Zirui Wang,Vijay Vasudevan,Alexander Ku,Yinfei Yang,Burcu Karagol Ayan,Ben Hutchinson,Wei Han,Zarana Parekh,Xin Li,Han Zhang,Jason Baldridge,Yonghui Wu

from arxiv, Preprint

We present the Pathways Autoregressive Text-to-Image (Parti) model, which generates high-fidelity photorealistic images and supports content-rich synthesis involving complex compositions and world knowledge. Parti treats text-to-image generation as a sequence-to-sequence modeling problem, akin to machine translation, with sequences of image tokens as the target outputs rather than text tokens in another language. This strategy can naturally tap into the rich body of prior work on large language models, which have seen continued advances in capabilities and performance through scaling data and model sizes. Our approach is simple: First, Parti uses a Transformer-based image tokenizer, ViT-VQGAN, to encode images as sequences of discrete tokens. Second, we achieve consistent quality improvements by scaling the encoder-decoder Transformer model up to 20B parameters, with a new state-of-the-art zero-shot FID score of 7.23 and finetuned FID score of 3.22 on MS-COCO. Our detailed analysis on Localized Narratives as well as PartiPrompts (P2), a new holistic benchmark of over 1600 English prompts, demonstrate the effectiveness of Parti across a wide variety of categories and difficulty aspects. We also explore and highlight limitations of our models in order to define and exemplify key areas of focus for further improvements. See https://parti.research.google/ for high-resolution images.

翻译：我们展示了“路径自动递增文本到图像( Parti) ” 模型, 它生成了高不易懂的光现实图像, 支持内容丰富的合成, 包括复杂的成份和世界知识。 Parti 将文本到图像生成视为一个序列到序列的建模问题, 类似于机器翻译, 图像符号序列是目标输出, 而不是另一种语言的文本符号。这个战略可以自然地利用大型语言模型先前的丰富工作, 这些模型通过扩大数据和模型大小,在能力和性能方面不断取得进步。我们的方法很简单: 首先, Parti 使用基于变异图像的表示器ViT- VQGAN, 将图像到图像生成作为离散符号的序列进行编码。其次, 我们通过将编码- 解码转换器转换器模型的序列提升到20B 参数, 实现持续的质量改进, 新的状态- 零点FID 评分为7. 23, 微调FID 评分为3. MS- CO 。我们对于基于本地的批量和广度的图解的图像符号的改进进行详细分析, 分析, 以及跨部分的深度的精确地标度分析, 定义了我们16 的精确的精确的精确的精确度, 和精确地标定了整个的精度, 和精确的精确的精确的精确的精确的精确的精确的精确的精确的精确度, 。

相关内容

MoDELS

关注 43

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

100+篇《自监督学习(Self-Supervised Learning)》论文最新合集

专知会员服务

166+阅读 · 2020年3月18日

图像分类技巧集，17页ppt《Bag of Tricks for Image Classification》

专知会员服务

96+阅读 · 2020年3月12日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

49+阅读 · 2019年10月17日

Deep Learning Based Detection and Correction of Cardiac MR Motion Artefacts During Reconstruction for High-Quality Segmentation

专知会员服务

59+阅读 · 2019年10月17日