By decomposing the image formation process into a sequential application of denoising autoencoders, diffusion models (DMs) achieve state-of-the-art synthesis results on image data and beyond. Additionally, their formulation allows for a guiding mechanism to control the image generation process without retraining. However, since these models typically operate directly in pixel space, optimization of powerful DMs often consumes hundreds of GPU days and inference is expensive due to sequential evaluations. To enable DM training on limited computational resources while retaining their quality and flexibility, we apply them in the latent space of powerful pretrained autoencoders. In contrast to previous work, training diffusion models on such a representation allows for the first time to reach a near-optimal point between complexity reduction and detail preservation, greatly boosting visual fidelity. By introducing cross-attention layers into the model architecture, we turn diffusion models into powerful and flexible generators for general conditioning inputs such as text or bounding boxes and high-resolution synthesis becomes possible in a convolutional manner. Our latent diffusion models (LDMs) achieve a new state of the art for image inpainting and highly competitive performance on various tasks, including unconditional image generation, semantic scene synthesis, and super-resolution, while significantly reducing computational requirements compared to pixel-based DMs. Code is available at https://github.com/CompVis/latent-diffusion .
翻译:通过将图像形成过程分解成分解自动编码器的顺序应用,传播模型(DMs)在图像数据及其他方面实现最先进的合成结果。此外,这些模型的制定为控制图像生成过程提供了一种指导机制,不进行再培训。然而,由于这些模型通常在像素空间直接运作,因此,由于连续评估,优化强大的管理模型往往消耗数百个GPU日,推论费用高昂。为了使管理部关于有限计算资源的培训同时保留其质量和灵活性,我们将这些模型应用到强大、经过预先训练的自动编码器的潜在空间。与以往的工作不同,关于这种表述的培训传播模型使得首次能够在减少复杂性和详细保存之间达到接近最佳的点,大大提升视觉忠诚。由于在模型结构中引入交叉注意层,我们将传播模型转化为强大和灵活的生成器,用于一般调节,如文本或捆绑盒和高分辨率合成,因此有可能以动态方式进行。我们的潜在传播模型(LDMSMs)在制作图像时实现了一种新的艺术状态,同时大幅降低用于制作和高竞争力的图像合成,在可比较的版本/版本中进行。