Diffusion models have shown incredible capabilities as generative models; indeed, they power the current state-of-the-art models on text-conditioned image generation such as Imagen and DALL-E 2. In this work we review, demystify, and unify the understanding of diffusion models across both variational and score-based perspectives. We first derive Variational Diffusion Models (VDM) as a special case of a Markovian Hierarchical Variational Autoencoder, where three key assumptions enable tractable computation and scalable optimization of the ELBO. We then prove that optimizing a VDM boils down to learning a neural network to predict one of three potential objectives: the original source input from any arbitrary noisification of it, the original source noise from any arbitrarily noisified input, or the score function of a noisified input at any arbitrary noise level. We then dive deeper into what it means to learn the score function, and connect the variational perspective of a diffusion model explicitly with the Score-based Generative Modeling perspective through Tweedie's Formula. Lastly, we cover how to learn a conditional distribution using diffusion models via guidance.
翻译:传播模型显示了惊人的基因模型能力;事实上,它们赋予了目前最先进的模型在像图像和DALL-E 2. 等文本条件图像生成上的力量。 在这一工作中,我们审查、解密和统一了对不同角度和分数角度传播模型的理解。我们首先将变异扩散模型作为马克沃亚高等级变异自动编码器的一个特殊案例来计算。 在这三个关键假设中,可以进行可移植的计算和可缩放的ELBO。然后,我们证明,优化VDM将学习一个神经网络,以预测三种潜在目标之一:任何任意的注解的原始源投入,任何任意注解输入的原始源噪音,或任何任意噪音等级的注解输入的分数功能。然后,我们更深入地探讨学习得分函数的含义,并将传播模型的变异性观点与基于分数的模型观点通过Tweedie的公式进行连接。最后,我们通过折射模型学习有条件的分布方法。