Natural language offers a highly intuitive interface for image editing. In this paper, we introduce the first solution for performing local (region-based) edits in generic natural images, based on a natural language description along with an ROI mask. We achieve our goal by leveraging and combining a pretrained language-image model (CLIP), to steer the edit towards a user-provided text prompt, with a denoising diffusion probabilistic model (DDPM) to generate natural-looking results. To seamlessly fuse the edited region with the unchanged parts of the image, we spatially blend noised versions of the input image with the local text-guided diffusion latent at a progression of noise levels. In addition, we show that adding augmentations to the diffusion process mitigates adversarial results. We compare against several baselines and related methods, both qualitatively and quantitatively, and show that our method outperforms these solutions in terms of overall realism, ability to preserve the background and matching the text. Finally, we show several text-driven editing applications, including adding a new object to an image, removing/replacing/altering existing objects, background replacement, and image extrapolation.
翻译:自然语言为图像编辑提供了一个非常直观的界面。 在本文中, 我们引入了第一个基于自然语言描述的通用自然图像( 以区域为基础) 编辑的第一个解决方案, 以自然语言描述为基础, 并使用 ROI 掩码 。 我们通过利用和合并一个经过预先训练的语言图像模型( CLIP ) 来实现我们的目标, 将编辑引向一个用户提供的快速文本, 使用一种分流的传播概率模型( DDPM ) 来生成自然图像。 为了将编辑区域与图像的不变部分无缝结合, 我们将输入图像的节点化版本与本地文本导扩散潜伏在噪音水平上进行空间化混合。 此外, 我们显示, 在传播过程中增加增强能减轻对抗性结果。 我们比较了一些基准和相关方法, 质量和数量上, 并显示我们的方法在总体真实性、 保存背景和匹配文本的能力上优于这些解决方案。 最后, 我们展示了几个文本驱动的编辑应用程序, 包括给图像添加一个新对象、 删除/ / 改变/ 现有对象、 背景替换和图像外加 。