Large-scale text-to-image generative models have shown their remarkable ability to synthesize diverse and high-quality images. However, it is still challenging to directly apply these models for editing real images for two reasons. First, it is hard for users to come up with a perfect text prompt that accurately describes every visual detail in the input image. Second, while existing models can introduce desirable changes in certain regions, they often dramatically alter the input content and introduce unexpected changes in unwanted regions. In this work, we propose pix2pix-zero, an image-to-image translation method that can preserve the content of the original image without manual prompting. We first automatically discover editing directions that reflect desired edits in the text embedding space. To preserve the general content structure after editing, we further propose cross-attention guidance, which aims to retain the cross-attention maps of the input image throughout the diffusion process. In addition, our method does not need additional training for these edits and can directly use the existing pre-trained text-to-image diffusion model. We conduct extensive experiments and show that our method outperforms existing and concurrent works for both real and synthetic image editing.
翻译:大规模文本到图像的基因化模型显示,它们具有不同和高质量图像的惊人综合能力。然而,直接应用这些模型来编辑真实图像仍具有挑战性,原因有两个。首先,用户很难找到一个完美的文本提示,准确描述输入图像中的每一个视觉细节。第二,虽然现有的模型可以在某些区域引入可取的变化,但它们往往会大大改变输入内容,并在不想要的区域引入出乎意料的变化。在这项工作中,我们提议了象素-零,一种图像到图像的翻译方法,可以保存原始图像的内容,而无需手动提示。我们首先自动发现编辑方向,在文本嵌入空间中反映所期望的编辑。为了在编辑后保存一般内容结构,我们进一步提出交叉注意指南,目的是在整个传播过程中保留输入图像的交叉注意图。此外,我们的方法不需要对这些编辑进行额外培训,可以直接使用现有的经事先培训的文本到图像传播模型。我们进行了广泛的实验,并显示我们的方法在实际和合成图像编辑方面都比现有的和并行工作要好。