Large-scale diffusion-based generative models have led to breakthroughs in text-conditioned high-resolution image synthesis. Starting from random noise, such text-to-image diffusion models gradually synthesize images in an iterative fashion while conditioning on text prompts. We find that their synthesis behavior qualitatively changes throughout this process: Early in sampling, generation strongly relies on the text prompt to generate text-aligned content, while later, the text conditioning is almost entirely ignored. This suggests that sharing model parameters throughout the entire generation process may not be ideal. Therefore, in contrast to existing works, we propose to train an ensemble of text-to-image diffusion models specialized for different synthesis stages. To maintain training efficiency, we initially train a single model, which is then split into specialized models that are trained for the specific stages of the iterative generation process. Our ensemble of diffusion models, called eDiffi, results in improved text alignment while maintaining the same inference computation cost and preserving high visual quality, outperforming previous large-scale text-to-image diffusion models on the standard benchmark. In addition, we train our model to exploit a variety of embeddings for conditioning, including the T5 text, CLIP text, and CLIP image embeddings. We show that these different embeddings lead to different behaviors. Notably, the CLIP image embedding allows an intuitive way of transferring the style of a reference image to the target text-to-image output. Lastly, we show a technique that enables eDiffi's "paint-with-words" capability. A user can select the word in the input text and paint it in a canvas to control the output, which is very handy for crafting the desired image in mind. The project page is available at https://deepimagination.cc/eDiffi/
翻译:大型的基于扩散的基因化模型导致文本条件高清晰度图像合成的突破。 从随机噪音开始,这种文本到图像的图像扩散模型会以迭接的方式以迭接的方式逐渐合成图像,同时对文本提示进行调试。 我们发现,在整个过程中,它们的合成行为在质量上发生了质的变化: 在取样初期, 生成强烈依赖文本快速生成文本来生成文本调适内容, 而后来, 文本调控几乎完全被完全忽略。 这意味着在整个生成过程中共享模型参数可能并不理想。 因此, 与现有的工程相比, 我们提议从随机噪音开始, 这些文本到图像合成阶段的成像化模型会逐渐地以迭接的方式将图像合成成迭接合。 在标准基准中, 我们用一个模块来开发一个版本到图像化的成像素传播模式, 并用一个版本化的C- 格式化输出输出输出, 以显示一个不同图像的C- 格式化输出, 以显示这些图像的成型的C- 格式化成型的图像, 将一个图像转换为C- 格式的立式图像。