Recently, GAN inversion methods combined with Contrastive Language-Image Pretraining (CLIP) enables zero-shot image manipulation guided by text prompts. However, their applications to diverse real images are still difficult due to the limited GAN inversion capability. Specifically, these approaches often have difficulties in reconstructing images with novel poses, views, and highly variable contents compared to the training data, altering object identity, or producing unwanted image artifacts. To mitigate these problems and enable faithful manipulation of real images, we propose a novel method, dubbed DiffusionCLIP, that performs text-driven image manipulation using diffusion models. Based on full inversion capability and high-quality image generation power of recent diffusion models, our method performs zero-shot image manipulation successfully even between unseen domains. Furthermore, we propose a novel noise combination method that allows straightforward multi-attribute manipulation. Extensive experiments and human evaluation confirmed robust and superior manipulation performance of our methods compared to the existing baselines.
翻译:最近,GAN的反向方法与反语言图像预设培训相结合,使得在文本提示的指导下对图像进行零光操作。然而,由于GAN的反向能力有限,这些方法对各种真实图像的应用仍很困难。具体地说,这些方法往往难以与培训数据、改变对象身份或制作不需要的图像制品相比,用新的面貌、观点和高度变异的内容来重建图像。为了减轻这些问题,并能够忠实地操纵真实图像,我们提议了一种新颖的方法,称为Diflupped DifmuncLIP,利用传播模型进行文本驱动图像操纵。基于最近传播模型的完全反向能力和高质量图像生成能力,我们的方法甚至在隐蔽的域间也成功地进行了零光图像操纵。此外,我们提出了一种新的噪音组合方法,允许直接的多归因操纵。广泛的实验和人类评估证实了我们方法与现有基线相比的强大和优越的操纵性表现。