eDiff-I:由专家Denoisers组合成的文本到图像扩散模型 (eDiff-I: Text-to-Image Diffusion Models with an Ensemble of Expert Denoisers)

Yogesh Balaji,Seungjun Nah,Xun Huang,Arash Vahdat,Jiaming Song,Karsten Kreis,Miika Aittala,Timo Aila,Samuli Laine,Bryan Catanzaro,Tero Karras,Ming-Yu Liu

Large-scale diffusion-based generative models have led to breakthroughs in text-conditioned high-resolution image synthesis. Starting from random noise, such text-to-image diffusion models gradually synthesize images in an iterative fashion while conditioning on text prompts. We find that their synthesis behavior qualitatively changes throughout this process: Early in sampling, generation strongly relies on the text prompt to generate text-aligned content, while later, the text conditioning is almost entirely ignored. This suggests that sharing model parameters throughout the entire generation process may not be ideal. Therefore, in contrast to existing works, we propose to train an ensemble of text-to-image diffusion models specialized for different synthesis stages. To maintain training efficiency, we initially train a single model, which is then split into specialized models that are trained for the specific stages of the iterative generation process. Our ensemble of diffusion models, called eDiff-I, results in improved text alignment while maintaining the same inference computation cost and preserving high visual quality, outperforming previous large-scale text-to-image diffusion models on the standard benchmark. In addition, we train our model to exploit a variety of embeddings for conditioning, including the T5 text, CLIP text, and CLIP image embeddings. We show that these different embeddings lead to different behaviors. Notably, the CLIP image embedding allows an intuitive way of transferring the style of a reference image to the target text-to-image output. Lastly, we show a technique that enables eDiff-I's "paint-with-words" capability. A user can select the word in the input text and paint it in a canvas to control the output, which is very handy for crafting the desired image in mind. The project page is available at https://deepimagination.cc/eDiff-I/

翻译：大型的基于扩散的基因化模型导致文本条件高分辨率图像合成的突破。从随机噪音开始,这种文本到图像的图像扩散模型以迭接的方式以迭接的方式逐渐合成图像,同时对文本提示进行调试。我们发现,在整个过程中,它们的合成行为在质量上发生了质的变化: 在取样初期, 生成强烈依赖文本快速生成文本以生成文本调适内容, 而后来, 文本调节几乎完全被完全忽略。这意味着在整个生成过程中共享模型参数可能并不理想。因此, 与现有的工程相比, 我们提议培训一个专门用于不同合成阶段的文本到图像转换的文本组合。为了保持培训效率, 我们最初培训了一个单一的模型, 然后分成一个专门模型, 用于为迭接生成过程的具体阶段培训。我们的传播模型, 叫做 eiff- I, 结果是改进文本调和保持相同的计算成本和保持高视觉质量, 在标准基准中, 超过以前的大文本到旧版本的文本- 传播模式。此外, 我们训练我们的参考模式, 利用一种嵌入式的C- 输出输出, 显示不同的图像, 显示一种嵌入式的C- L 显示不同的图像。

相关内容

MoDELS

关注 43

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

NeurlPS 2022 | 自然语言处理相关论文分类整理

专知会员服务

51+阅读 · 2022年10月2日

高效可扩展图神经网络的研究进展，Recent Advances in Efficient and Scalable Graph Neural Networks

专知会员服务

78+阅读 · 2022年3月15日

100+篇《自监督学习(Self-Supervised Learning)》论文最新合集

专知会员服务

166+阅读 · 2020年3月18日

【跨语言BERT模型大集合】Transfer learning is increasingly going multilingual with language-specific BERT models

专知会员服务

54+阅读 · 2020年1月30日