Controllable image synthesis models allow creation of diverse images based on text instructions or guidance from a reference image. Recently, denoising diffusion probabilistic models have been shown to generate more realistic imagery than prior methods, and have been successfully demonstrated in unconditional and class-conditional settings. We investigate fine-grained, continuous control of this model class, and introduce a novel unified framework for semantic diffusion guidance, which allows either language or image guidance, or both. Guidance is injected into a pretrained unconditional diffusion model using the gradient of image-text or image matching scores. We explore CLIP-based language guidance as well as both content and style-based image guidance in a unified framework. Our text-guided synthesis approach can be applied to datasets without associated text annotations. We conduct experiments on FFHQ and LSUN datasets, and show results on fine-grained text-guided image synthesis, synthesis of images related to a style or content reference image, and examples with both textual and image guidance.
翻译:可控图像合成模型允许根据参考图像的文本指示或指导创建不同的图像。 最近, 已经展示了去除扩散概率模型, 以产生比以往方法更现实的图像, 并且已经在无条件和等级条件设置中成功展示。 我们调查了该模型类的精细和连续控制, 并引入了语言或图像指导或两者都允许的语义扩散指导新颖的统一框架。 指南被注入了使用图像文本梯度或图像匹配分数的预先训练的无条件传播模型。 我们探索了基于 CLIP 的语言指导以及内容和风格图像指导。 我们的文本指导合成方法可以应用到数据集中, 无需相关的文本说明。 我们在 FFHQ 和 LSUN 数据集上进行了实验, 并展示了精细的文本指导图像合成结果、 与样式或内容参考图像相匹配的图像合成结果, 以及具有文本和图像指导的示例 。