Large-scale diffusion-based generative models have led to breakthroughs in text-conditioned high-resolution image synthesis. Starting from random noise, such text-to-image diffusion models gradually synthesize images in an iterative fashion while conditioning on text prompts. We find that their synthesis behavior qualitatively changes throughout this process: Early in sampling, generation strongly relies on the text prompt to generate text-aligned content, while later, the text conditioning is almost entirely ignored. This suggests that sharing model parameters throughout the entire generation process may not be ideal. Therefore, in contrast to existing works, we propose to train an ensemble of text-to-image diffusion models specialized for different synthesis stages. To maintain training efficiency, we initially train a single model, which is then split into specialized models that are trained for the specific stages of the iterative generation process. Our ensemble of diffusion models, called eDiff-I, results in improved text alignment while maintaining the same inference computation cost and preserving high visual quality, outperforming previous large-scale text-to-image diffusion models on the standard benchmark. In addition, we train our model to exploit a variety of embeddings for conditioning, including the T5 text, CLIP text, and CLIP image embeddings. We show that these different embeddings lead to different behaviors. Notably, the CLIP image embedding allows an intuitive way of transferring the style of a reference image to the target text-to-image output. Lastly, we show a technique that enables eDiff-I's "paint-with-words" capability. A user can select the word in the input text and paint it in a canvas to control the output, which is very handy for crafting the desired image in mind. The project page is available at https://deepimagination.cc/eDiff-I/
翻译:大型的基于扩散的基因化模型导致文本条件高分辨率图像合成的突破。 从随机噪音开始,这种文本到图像的图像扩散模型以迭接的方式以迭接的方式逐渐合成图像,同时对文本提示进行调试。 我们发现,在整个过程中,它们的合成行为在质量上发生了质的变化: 在取样初期, 生成强烈依赖文本快速生成文本以生成文本调适内容, 而后来, 文本调节几乎完全被完全忽略。 这意味着在整个生成过程中共享模型参数可能并不理想。 因此, 与现有的工程相比, 我们提议培训一个专门用于不同合成阶段的文本到图像转换的文本组合。 为了保持培训效率, 我们最初培训了一个单一的模型, 然后分成一个专门模型, 用于为迭接生成过程的具体阶段培训。 我们的传播模型, 叫做 eiff- I, 结果是改进文本调和保持相同的计算成本和保持高视觉质量, 在标准基准中, 超过以前的大文本到旧版本的文本- 传播模式。 此外, 我们训练我们的参考模式, 利用一种嵌入式的C- 输出输出, 显示不同的图像, 显示一种嵌入式的C- L 显示不同的图像。