Recently, GAN inversion methods combined with Contrastive Language-Image Pretraining (CLIP) enables zero-shot image manipulation guided by text prompts. However, their applications to diverse real images are still difficult due to the limited GAN inversion capability. Specifically, these approaches often have difficulties in reconstructing images with novel poses, views, and highly variable contents compared to the training data, altering object identity, or producing unwanted image artifacts. To mitigate these problems and enable faithful manipulation of real images, we propose a novel method, dubbed DiffusionCLIP, that performs text-driven image manipulation using diffusion models. Based on full inversion capability and high-quality image generation power of recent diffusion models, our method performs zero-shot image manipulation successfully even between unseen domains and takes another step towards general application by manipulating images from a widely varying ImageNet dataset. Furthermore, we propose a novel noise combination method that allows straightforward multi-attribute manipulation. Extensive experiments and human evaluation confirmed robust and superior manipulation performance of our methods compared to the existing baselines. Code is available at https://github.com/gwang-kim/DiffusionCLIP.git.
翻译:最近,GAN反向方法与相悖语言图像预设培训(CLIP)相结合,使得在文本提示的指导下对图像进行零光操作。然而,由于GAN反向能力有限,这些方法对各种真实图像的应用仍然很困难。具体地说,这些方法往往难以与培训数据、改变对象身份或制作不需要的图像文物相比,用新面貌、观点和高度变异的内容来重建图像。为了减轻这些问题,并能够忠实地操纵真实图像,我们提议了一种新颖的方法,称为Difmunfed Difmunclip CLIP,利用传播模型进行文本驱动图像操纵。根据最近传播模型的完全反向能力和高质量图像生成能力,我们的方法甚至在隐蔽域之间也成功地进行了零光图像操纵,并且通过调控一个差异巨大的图像网络数据集的图像又迈出了又一步。此外,我们提议了一种新颖的噪音组合方法,允许直接的多属性操纵。广泛的实验和人类评估证实了我们方法与现有基线相比的操纵性强而优。代码可在 https://github.com/gwangwang-kim/Difclus/DifgillgillgillgillgiIP.