Controllable image synthesis models allow creation of diverse images based on text instructions or guidance from a reference image. Recently, denoising diffusion probabilistic models have been shown to generate more realistic imagery than prior methods, and have been successfully demonstrated in unconditional and class-conditional settings. We investigate fine-grained, continuous control of this model class, and introduce a novel unified framework for semantic diffusion guidance, which allows either language or image guidance, or both. Guidance is injected into a pretrained unconditional diffusion model using the gradient of image-text or image matching scores, without re-training the diffusion model. We explore CLIP-based language guidance as well as both content and style-based image guidance in a unified framework. Our text-guided synthesis approach can be applied to datasets without associated text annotations. We conduct experiments on FFHQ and LSUN datasets, and show results on fine-grained text-guided image synthesis, synthesis of images related to a style or content reference image, and examples with both textual and image guidance.
翻译:可控图像合成模型允许根据参考图像的文本指示或指导制作不同图像。 最近,已经展示了去除扩散概率模型,以产生比以往方法更现实的图像,并在无条件和等级条件设置中成功展示。 我们调查了该模型类的精细和连续控制,并引入了一个新的语义扩散指导统一框架,允许语言或图像指导,或两者兼而有之。 指南被注入一个预先训练的无条件传播模型,使用图像文本梯度或图像匹配得分,而不对传播模型进行再培训。 我们探索了基于 CLIP 的语言指导以及内容和风格图像指导。 我们的文本指导合成方法可以应用到数据集中,而没有相关的文本说明。 我们在FFHQ 和 LSUN 数据集上进行了实验,并展示了精制文本指导图像合成的结果、与样式或内容参考图像相关的图像合成,以及带有文本和图像指导的示例。