Recent text-guided diffusion models provide powerful image generation capabilities. Currently, a massive effort is given to enable the modification of these images using text only as means to offer intuitive and versatile editing. To edit a real image using these state-of-the-art tools, one must first invert the image with a meaningful text prompt into the pretrained model's domain. In this paper, we introduce an accurate inversion technique and thus facilitate an intuitive text-based modification of the image. Our proposed inversion consists of two novel key components: (i) Pivotal inversion for diffusion models. While current methods aim at mapping random noise samples to a single input image, we use a single pivotal noise vector for each timestamp and optimize around it. We demonstrate that a direct inversion is inadequate on its own, but does provide a good anchor for our optimization. (ii) NULL-text optimization, where we only modify the unconditional textual embedding that is used for classifier-free guidance, rather than the input text embedding. This allows for keeping both the model weights and the conditional embedding intact and hence enables applying prompt-based editing while avoiding the cumbersome tuning of the model's weights. Our Null-text inversion, based on the publicly available Stable Diffusion model, is extensively evaluated on a variety of images and prompt editing, showing high-fidelity editing of real images.
翻译:最新文本制化的传播模型提供了强大的图像生成能力。 目前, 正在做出巨大的努力, 使这些图像的修改能够仅以文本作为提供直观和多功能编辑的手段。 要使用这些最先进的工具编辑真实图像, 首先必须将有意义的文本转换到预培训模型的域内。 在此文件中, 我们引入了准确的反向技术, 从而方便了对图像进行直观的基于文本的修改。 我们提议的反向包含两个新颖的关键组成部分:(一) 传播模型的动态反向。 目前的方法旨在将随机噪音样本绘制成一个单一输入图像, 我们使用一个单一的关键噪声矢量矢量, 用于每个时间戳并优化周围的图像。 我们证明直接反向图像本身不够, 但确实为我们优化提供了良好的定位。 (二) NULLLL- 文本优化, 我们只修改用于解析器免费指导的无条件文本嵌入, 而不是输入文本嵌入。 这样可以将模型的重量和有条件的嵌入式图像标定成一个单一输入图像图像图像, 我们使用一个单一的枢轴, 并优化其周围的图像。 我们的快速的快速的校正版, 将快速的校正的校订, 以显示我们基于 的快速的高级的校正的校订。