How do diffusion generative models convert pure noise into meaningful images? We argue that generation involves first committing to an outline, and then to finer and finer details. The corresponding reverse diffusion process can be modeled by dynamics on a (time-dependent) high-dimensional landscape full of Gaussian-like modes, which makes the following predictions: (i) individual trajectories tend to be very low-dimensional; (ii) scene elements that vary more within training data tend to emerge earlier; and (iii) early perturbations substantially change image content more often than late perturbations. We show that the behavior of a variety of trained unconditional and conditional diffusion models like Stable Diffusion is consistent with these predictions. Finally, we use our theory to search for the latent image manifold of diffusion models, and propose a new way to generate interpretable image variations. Our viewpoint suggests generation by GANs and diffusion models have unexpected similarities.
翻译:扩散基因模型如何将纯噪音转换成有意义的图像? 我们争论说, 生成过程首先需要承诺一个大纲, 然后更精细、更细细的细节。 相应的反向扩散过程可以通过一个充满高山式模式的(依赖时间的)高维地貌的动态模型来模拟, 从而作出以下预测:(一) 单个轨迹往往非常低的维度;(二) 培训数据中变化较多的场景元素更早出现;(三) 早期扰动会大大改变图像内容, 而不是较晚的扰动。 我们表明,各种经过训练的无条件和有条件的传播模型(如稳定扩散模型)的行为与这些预测是一致的。 最后, 我们利用我们的理论来寻找扩散模型的潜在图像组合, 并提出产生可解释的图像变化的新方式。 我们的观点表明, GANs 和传播模型的生成过程有着意想不到的相似之处。</s>