在文本到图像传播模型中解析分解能力 (Uncovering the Disentanglement Capability in Text-to-Image Diffusion Models)

Generative models have been widely studied in computer vision. Recently, diffusion models have drawn substantial attention due to the high quality of their generated images. A key desired property of image generative models is the ability to disentangle different attributes, which should enable modification towards a style without changing the semantic content, and the modification parameters should generalize to different images. Previous studies have found that generative adversarial networks (GANs) are inherently endowed with such disentanglement capability, so they can perform disentangled image editing without re-training or fine-tuning the network. In this work, we explore whether diffusion models are also inherently equipped with such a capability. Our finding is that for stable diffusion models, by partially changing the input text embedding from a neutral description (e.g., "a photo of person") to one with style (e.g., "a photo of person with smile") while fixing all the Gaussian random noises introduced during the denoising process, the generated images can be modified towards the target style without changing the semantic content. Based on this finding, we further propose a simple, light-weight image editing algorithm where the mixing weights of the two text embeddings are optimized for style matching and content preservation. This entire process only involves optimizing over around 50 parameters and does not fine-tune the diffusion model itself. Experiments show that the proposed method can modify a wide range of attributes, with the performance outperforming diffusion-model-based image-editing algorithms that require fine-tuning. The optimized weights generalize well to different images. Our code is publicly available at https://github.com/UCSB-NLP-Chang/DiffusionDisentanglement.

翻译：生成模型已经在计算机视野中进行了广泛研究。最近, 扩散模型因其生成图像的高质量而引起大量关注。图像生成模型的关键理想属性是能够分解不同属性, 从而能够将输入文字从中性描述( 例如, “ 人的照片” ) 部分地嵌入到不同图像。先前的研究发现, 基因对抗网络( GANs) 天生就具有如此分解的能力, 这样它们就可以在不进行再培训或微调网络的情况下进行分解图像编辑。在这项工作中, 我们探索的是, 扩散模型是否也具有这种能力。我们的发现是, 稳定传播模型的关键属性。通过部分地将输入文字从中性描述( 例如, “ 人的照片” ) 嵌入到有不同图像的样式。之前的研究发现, 基因对抗网络( gANs) 具有这样的分解能力, 因而它们可以在不进行再培训或微调网络时将生成的图像修改为目标格式, 而不会改变语义内容。基于此发现, 我们进一步建议, 将一个简单、度递增缩的图像流缩的图像格式, 需要将两个版本的缩缩化的缩缩缩缩缩化的版本, 格式化的缩缩缩化, 。

相关内容

MoDELS

关注 43

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/