Diffusion-based generative models have demonstrated a capacity for perceptually impressive synthesis, but can they also be great likelihood-based models? We answer this in the affirmative, and introduce a family of diffusion-based generative models that obtain state-of-the-art likelihoods on standard image density estimation benchmarks. Unlike other diffusion-based models, our method allows for efficient optimization of the noise schedule jointly with the rest of the model. We show that the variational lower bound (VLB) simplifies to a remarkably short expression in terms of the signal-to-noise ratio of the diffused data, thereby improving our theoretical understanding of this model class. Using this insight, we prove an equivalence between several models proposed in the literature. In addition, we show that the continuous-time VLB is invariant to the noise schedule, except for the signal-to-noise ratio at its endpoints. This enables us to learn a noise schedule that minimizes the variance of the resulting VLB estimator, leading to faster optimization. Combining these advances with architectural improvements, we obtain state-of-the-art likelihoods on image density estimation benchmarks, outperforming autoregressive models that have dominated these benchmarks for many years, with often significantly faster optimization. In addition, we show how to use the model as part of a bits-back compression scheme, and demonstrate lossless compression rates close to the theoretical optimum. Code is available at https://github.com/google-research/vdm .
翻译:基于扩散的生成模型展现出了极强的感知合成能力,但它们能否成为出色的似然模型呢?我们回答了这个问题,提出了一族基于扩散的生成模型,在标准图像密度估计基准测试中获得了最先进的似然值。与其他基于扩散的模型不同,我们的方法允许有效地联合优化噪声时间表和模型的其余部分。我们展示了变分下界(VLB)可以简化为扩散数据信噪比的一个非常短的表达式,从而提高了我们对这一模型类的理论理解。利用这一见解,我们证明了文献中提出的几种模型之间的等价性。此外,我们展示了连续时间下的VLB对于噪声时间表是不变的,除了其端点处的信噪比。这使我们能够学习一种噪声时间表,来最小化VLB估计器的方差,从而导致更快的优化。将这些进展与结构改进相结合,我们在图像密度估计基准测试中获得了最先进的似然值,优于占据这些基准测试多年的自回归模型,其优化速度往往更快。此外,我们展示了如何将模型作为Bits-back压缩方案的一部分使用,并演示了接近理论最优值的无损压缩速率。代码可在 https://github.com/google-research/vdm 上获得。