Diffusion generative models have recently greatly improved the power of text-conditioned image generation. Existing image generation models mainly include text conditional diffusion model and cross-modal guided diffusion model, which are good at small scene image generation and complex scene image generation respectively. In this work, we propose a simple yet effective approach, namely UPainting, to unify simple and complex scene image generation, as shown in Figure 1. Based on architecture improvements and diverse guidance schedules, UPainting effectively integrates cross-modal guidance from a pretrained image-text matching model into a text conditional diffusion model that utilizes a pretrained Transformer language model as the text encoder. Our key findings is that combining the power of large-scale Transformer language model in understanding language and image-text matching model in capturing cross-modal semantics and style, is effective to improve sample fidelity and image-text alignment of image generation. In this way, UPainting has a more general image generation capability, which can generate images of both simple and complex scenes more effectively. To comprehensively compare text-to-image models, we further create a more general benchmark, UniBench, with well-written Chinese and English prompts in both simple and complex scenes. We compare UPainting with recent models and find that UPainting greatly outperforms other models in terms of caption similarity and image fidelity in both simple and complex scenes.
翻译:在这项工作中,我们提出了一个简单而有效的方法,即Upainting,以统一简单而复杂的现场图像生成,如图1所示,以在结构改进和不同指导时间表的基础上,将预先培训的图像文本匹配模型的跨模式指南有效融入成一个有条件的文本传播模型,该模型使用预先培训的变异语言模型作为文本编码。我们的主要发现是,将大规模变异语言模型在理解语言和图像文本匹配模型以捕捉跨模式的语义和风格方面的力量结合起来,对于改进图像生成的模范和图像文本一致性是有效的。在这种方式中,更新具有更普遍的图像生成能力,能够更有效地生成简单和复杂的场景图像。为了全面比较文本到图像模型,我们进一步创建了一个更普遍的基准,UniBench和图像匹配模型在理解语言和图像的跨模式中相互结合,在简单和复杂的英式模型中,我们进一步创建了一个更普通的、更真实的、更精确的模型。