Denoising Diffusion Models (DDMs) have emerged as a strong competitor to Generative Adversarial Networks (GANs). However, despite their widespread use in image synthesis and editing applications, their latent space is still not as well understood. Recently, a semantic latent space for DDMs, coined `$h$-space', was shown to facilitate semantic image editing in a way reminiscent of GANs. The $h$-space is comprised of the bottleneck activations in the DDM's denoiser across all timesteps of the diffusion process. In this paper, we explore the properties of h-space and propose several novel methods for finding meaningful semantic directions within it. We start by studying unsupervised methods for revealing interpretable semantic directions in pretrained DDMs. Specifically, we show that global latent directions emerge as the principal components in the latent space. Additionally, we provide a novel method for discovering image-specific semantic directions by spectral analysis of the Jacobian of the denoiser w.r.t. the latent code. Next, we extend the analysis by finding directions in a supervised fashion in unconditional DDMs. We demonstrate how such directions can be found by relying on either a labeled data set of real images or by annotating generated samples with a domain-specific attribute classifier. We further show how to semantically disentangle the found direction by simple linear projection. Our approaches are applicable without requiring any architectural modifications, text-based guidance, CLIP-based optimization, or model fine-tuning.
翻译:去噪扩散模型已成为生成对抗网络(GAN)的有力竞争者。然而,尽管其在图像合成和编辑应用中的广泛使用,但其潜在空间仍不是很好理解。最近,在DDM中发现了一个称为"$h$-space"的语义潜在空间,这种空间在语义图像编辑方面与GAN非常相似。$h$-space由DDM的去噪器在扩散过程的所有时间步骤中的瓶颈激活所组成。在本文中,我们探索了h-空间的属性,并提出了几种在其中寻找有意义的语义方向的新方法。我们开始研究在预训练DDM中揭示可解释的语义方向的无监督方法。具体而言,我们发现全局潜在方向出现为潜在空间中的主成分。此外,我们提供了一种新的方法,通过求解去噪器关于潜在编码的Jacobi矩阵的谱分解来发现图像特异的方向。接下来,我们以无条件DDM为例,扩展了该分析。我们演示了这些方向如何通过依赖带有真实图像的标记数据集或通过用领域特定的属性分类器注释生成的样本来找到。我们进一步展示了如何通过简单的线性投影来进行语义分离。我们的方法适用于不需要任何架构修改、基于文本的指导、CLIP-based优化或模型微调。