Generative models have been widely studied in computer vision. Recently, diffusion models have drawn substantial attention due to the high quality of their generated images. A key desired property of image generative models is the ability to disentangle different attributes, which should enable modification towards a style without changing the semantic content, and the modification parameters should generalize to different images. Previous studies have found that generative adversarial networks (GANs) are inherently endowed with such disentanglement capability, so they can perform disentangled image editing without re-training or fine-tuning the network. In this work, we explore whether diffusion models are also inherently equipped with such a capability. Our finding is that for stable diffusion models, by partially changing the input text embedding from a neutral description (e.g., "a photo of person") to one with style (e.g., "a photo of person with smile") while fixing all the Gaussian random noises introduced during the denoising process, the generated images can be modified towards the target style without changing the semantic content. Based on this finding, we further propose a simple, light-weight image editing algorithm where the mixing weights of the two text embeddings are optimized for style matching and content preservation. This entire process only involves optimizing over around 50 parameters and does not fine-tune the diffusion model itself. Experiments show that the proposed method can modify a wide range of attributes, with the performance outperforming diffusion-model-based image-editing algorithms that require fine-tuning. The optimized weights generalize well to different images. Our code is publicly available at https://github.com/UCSB-NLP-Chang/DiffusionDisentanglement.
翻译:生成模型已经在计算机视野中进行了广泛研究。 最近, 扩散模型因其生成图像的高质量而引起大量关注。 图像生成模型的关键理想属性是能够分解不同属性, 从而能够将输入文字从中性描述( 例如, “ 人的照片” ) 部分地嵌入到不同图像。 先前的研究发现, 基因对抗网络( GANs) 天生就具有如此分解的能力, 这样它们就可以在不进行再培训或微调网络的情况下进行分解图像编辑。 在这项工作中, 我们探索的是, 扩散模型是否也具有这种能力。 我们的发现是, 稳定传播模型的关键属性。 通过部分地将输入文字从中性描述( 例如, “ 人的照片” ) 嵌入到有不同图像的样式。 之前的研究发现, 基因对抗网络( gANs) 具有这样的分解能力, 因而它们可以在不进行再培训或微调网络时将生成的图像修改为目标格式, 而不会改变语义内容。 基于此发现, 我们进一步建议, 将一个简单、 度 递增缩的图像流缩的图像格式, 需要将两个版本的缩缩化的缩缩缩缩缩化的版本, 格式化的缩缩缩化, 。