Recent advances in text-to-image generation with diffusion models present transformative capabilities in image quality. However, user controllability of the generated image, and fast adaptation to new tasks still remains an open challenge, currently mostly addressed by costly and long re-training and fine-tuning or ad-hoc adaptations to specific image generation tasks. In this work, we present MultiDiffusion, a unified framework that enables versatile and controllable image generation, using a pre-trained text-to-image diffusion model, without any further training or finetuning. At the center of our approach is a new generation process, based on an optimization task that binds together multiple diffusion generation processes with a shared set of parameters or constraints. We show that MultiDiffusion can be readily applied to generate high quality and diverse images that adhere to user-provided controls, such as desired aspect ratio (e.g., panorama), and spatial guiding signals, ranging from tight segmentation masks to bounding boxes. Project webpage: https://multidiffusion.github.io
翻译:以传播模型制作文本到图像方面的最新进展体现了图像质量的变革能力。然而,生成图像的用户可控制性以及快速适应新任务仍然是一项公开的挑战,目前,主要通过对特定图像生成任务进行昂贵和长期的再培训和微调或特别调整来解决。在这项工作中,我们介绍了多发化,这是一个统一框架,能够使用预先培训的文本到图像扩散模型,进行多功能和可控的图像生成,无需任何进一步的培训或微调。我们方法的核心是一个新一代过程,其基础是优化工作,将多个扩散生成进程与一套共同参数或制约因素结合在一起。我们表明,多发化可以很容易地用于生成高质量和多样的图像,以遵守用户提供的控制,例如理想的侧比(例如,全景),以及空间指导信号,从紧紧的分解面罩到捆绑盒。项目网页:https://multidifulation.ghub.github.io。