We present the Pathways Autoregressive Text-to-Image (Parti) model, which generates high-fidelity photorealistic images and supports content-rich synthesis involving complex compositions and world knowledge. Parti treats text-to-image generation as a sequence-to-sequence modeling problem, akin to machine translation, with sequences of image tokens as the target outputs rather than text tokens in another language. This strategy can naturally tap into the rich body of prior work on large language models, which have seen continued advances in capabilities and performance through scaling data and model sizes. Our approach is simple: First, Parti uses a Transformer-based image tokenizer, ViT-VQGAN, to encode images as sequences of discrete tokens. Second, we achieve consistent quality improvements by scaling the encoder-decoder Transformer model up to 20B parameters, with a new state-of-the-art zero-shot FID score of 7.23 and finetuned FID score of 3.22 on MS-COCO. Our detailed analysis on Localized Narratives as well as PartiPrompts (P2), a new holistic benchmark of over 1600 English prompts, demonstrate the effectiveness of Parti across a wide variety of categories and difficulty aspects. We also explore and highlight limitations of our models in order to define and exemplify key areas of focus for further improvements. See https://parti.research.google/ for high-resolution images.
翻译:我们展示了“路径自动递增文本到图像( Parti) ” 模型, 它生成了高不易懂的光现实图像, 支持内容丰富的合成, 包括复杂的成份和世界知识。 Parti 将文本到图像生成视为一个序列到序列的建模问题, 类似于机器翻译, 图像符号序列是目标输出, 而不是另一种语言的文本符号。 这个战略可以自然地利用大型语言模型先前的丰富工作, 这些模型通过扩大数据和模型大小,在能力和性能方面不断取得进步。 我们的方法很简单: 首先, Parti 使用基于变异图像的表示器ViT- VQGAN, 将图像到图像生成作为离散符号的序列进行编码。 其次, 我们通过将编码- 解码转换器转换器模型的序列提升到20B 参数, 实现持续的质量改进, 新的状态- 零点FID 评分为7. 23, 微调FID 评分为3. MS- CO 。 我们对于基于本地的批量和广度的图解的图像符号的改进进行详细分析, 分析, 以及跨部分的深度的精确地标度分析, 定义了我们16 的精确的精确的精确的精确度, 和精确地标定了整个的精度, 和精确的精确的精确的精确的精确的精确的精确的精确的精确的精确度, 。