Classifier-free guided diffusion models have recently been shown to be highly effective at high-resolution image generation, and they have been widely used in large-scale diffusion frameworks including DALLE-2, Stable Diffusion and Imagen. However, a downside of classifier-free guided diffusion models is that they are computationally expensive at inference time since they require evaluating two diffusion models, a class-conditional model and an unconditional model, tens to hundreds of times. To deal with this limitation, we propose an approach to distilling classifier-free guided diffusion models into models that are fast to sample from: Given a pre-trained classifier-free guided model, we first learn a single model to match the output of the combined conditional and unconditional models, and then we progressively distill that model to a diffusion model that requires much fewer sampling steps. For standard diffusion models trained on the pixel-space, our approach is able to generate images visually comparable to that of the original model using as few as 4 sampling steps on ImageNet 64x64 and CIFAR-10, achieving FID/IS scores comparable to that of the original model while being up to 256 times faster to sample from. For diffusion models trained on the latent-space (e.g., Stable Diffusion), our approach is able to generate high-fidelity images using as few as 1 to 4 denoising steps, accelerating inference by at least 10-fold compared to existing methods on ImageNet 256x256 and LAION datasets. We further demonstrate the effectiveness of our approach on text-guided image editing and inpainting, where our distilled model is able to generate high-quality results using as few as 2-4 denoising steps.
翻译:在高分辨率图像生成过程中,不使用分类器的辅助传播模型最近被证明非常有效,这些模型被广泛用于大型传播框架,包括DALLE-2、稳定分解和图像。然而,不使用分类器的辅助传播模型的下坡面是,这些模型在推论时间计算成本很高,因为它们需要评估两种扩散模型,一个等级-有条件模型和一个无条件模型,有数十至数百次。为了应对这一限制,我们建议采用一种方法,将不使用分类器的辅助传播模型蒸馏成快速取样的模型:鉴于一个事先经过培训的无分类器制导模型,我们首先学习一种单一模型,以匹配合并的有条件和无条件模型的输出,然后我们逐渐将该模型淡化成一个需要更少采样步骤的传播模型。对于在像素空间中培训的标准传播模型,我们的方法可以产生与原始模型的图像相近乎的图像,在图像网络64x64和CIFAR-10上仅采用4的取样步骤,在原始模型中取得与原始模型相比的更接近的FID/IS的成绩,同时以256比原始模型进行比较的模型,同时以256倍的升级的方式将我们的图像转换为可生成为快速地生成的模型,在S-255 快速的模型中生成的模型中,将数据转换为快速地生成为快速的模型,将数据转换为快速地生成。