Diffusion models have achieved great success in synthesizing diverse and high-fidelity images. However, sampling speed and memory constraints remain a major barrier to the practical adoption of diffusion models, since the generation process for these models can be slow due to the need for iterative noise estimation using compute-intensive neural networks. We propose to tackle this problem by compressing the noise estimation network to accelerate the generation process through post-training quantization (PTQ). While existing PTQ approaches have not been able to effectively deal with the changing output distributions of noise estimation networks in diffusion models over multiple time steps, we are able to formulate a PTQ method that is specifically designed to handle the unique multi-timestep structure of diffusion models with a data calibration scheme using data sampled from different time steps. Experimental results show that our proposed method is able to directly quantize full-precision diffusion models into 8-bit or 4-bit models while maintaining comparable performance in a training-free manner, achieving a FID change of at most 1.88. Our approach can also be applied to text-guided image generation, and for the first time we can run stable diffusion in 4-bit weights without losing much perceptual quality, as shown in Figure 5 and Figure 9.
翻译:然而,取样速度和记忆限制仍然是实际采用传播模型的主要障碍,因为这些模型的生成过程可能由于需要利用计算密集神经网络进行迭代噪音估计而缓慢。我们建议通过压缩噪音估计网络来解决这一问题,以便通过培训后量化来加速生成过程。虽然现有的PTQ方法无法有效处理多种时间步骤传播模型中噪音估计网络产出分布的变化,但我们能够制定一种PTQ方法,专门设计该方法是为了利用从不同时间步骤抽样的数据进行数据校准计划来处理独特的传播模型的多步骤结构。实验结果表明,我们拟议的方法能够直接将全面预测扩散模型分成8位或4位模型,同时以无培训方式保持可比的性能,实现最多1.88位的FID变化。我们的方法还可以用于文本制导图像生成,而用数据校准方法处理独特的多步骤结构,使用从不同时间步骤抽样的数据校准计划。实验结果表明,我们拟议的方法能够直接将全面预测扩散模型分成8位或4位模型,同时保持可比的性性业绩,实现最多1.88位的FD变化。我们的方法也可以应用于文本制制成图像的图像生成,而图质量显示,在每位缩为每位比例的首时,我们可以稳定地显示4位比例。