Classifier-free guided diffusion models have recently been shown to be highly effective at high-resolution image generation, and they have been widely used in large-scale diffusion frameworks including DALLE-2, Stable Diffusion and Imagen. However, a downside of classifier-free guided diffusion models is that they are computationally expensive at inference time since they require evaluating two diffusion models, a class-conditional model and an unconditional model, tens to hundreds of times. To deal with this limitation, we propose an approach to distilling classifier-free guided diffusion models into models that are fast to sample from: Given a pre-trained classifier-free guided model, we first learn a single model to match the output of the combined conditional and unconditional models, and then we progressively distill that model to a diffusion model that requires much fewer sampling steps. For standard diffusion models trained on the pixel-space, our approach is able to generate images visually comparable to that of the original model using as few as 4 sampling steps on ImageNet 64x64 and CIFAR-10, achieving FID/IS scores comparable to that of the original model while being up to 256 times faster to sample from. For diffusion models trained on the latent-space (e.g., Stable Diffusion), our approach is able to generate high-fidelity images using as few as 1 to 4 denoising steps, accelerating inference by at least 10-fold compared to existing methods on ImageNet 256x256 and LAION datasets. We further demonstrate the effectiveness of our approach on text-guided image editing and inpainting, where our distilled model is able to generate high-quality results using as few as 2-4 denoising steps.
翻译:无需分类器的引导扩散模型最近在高分辨率图像生成中表现出高效性并被广泛应用于大规模扩散框架,包括 DALLE-2,Stable Diffusion 和 Imagen。然而,无需分类器的引导扩散模型的一个缺点是,在推理时它们是计算密集型的,因为它们需要评估两个扩散模型,一个条件模型和一个无条件模型,需要评估几十到几百次。为了解决这个局限性,我们提出了一种将无需分类器的引导扩散模型蒸馏成更快速的模型的方法:给定一个预训练的无需分类器的引导模型,我们首先学习一个模型,匹配条件和无条件模型的输出,然后逐步将该模型蒸馏成需要更少的采样步数的扩散模型。对于在像素空间上训练的标准扩散模型,在 ImageNet 64x64 和 CIFAR-10 数据集上使用仅需 4 个采样步骤的模型就能够生成与原模型视觉上相似的图像,达到与原模型相当的 FID/IS 分数,而采样速度快了高达 256 倍。对于在潜在空间上训练的扩散模型(例如 Stable Diffusion),我们的方法能够在 ImageNet 256x256 和 LAION 数据集上仅使用 1 到 4 个降噪步骤就生成高保真图像,将推理加速至少 10 倍。我们进一步演示了我们的方法在文本引导的图像编辑和修复中的有效性,我们的蒸馏模型仅需要 2-4 个降噪步骤就能够生成高质量的结果。