The tremendous progress in neural image generation, coupled with the emergence of seemingly omnipotent vision-language models has finally enabled text-based interfaces for creating and editing images. Handling generic images requires a diverse underlying generative model, hence the latest works utilize diffusion models, which were shown to surpass GANs in terms of diversity. One major drawback of diffusion models, however, is their relatively slow inference time. In this paper, we present an accelerated solution to the task of local text-driven editing of generic images, where the desired edits are confined to a user-provided mask. Our solution leverages a recent text-to-image Latent Diffusion Model (LDM), which speeds up diffusion by operating in a lower-dimensional latent space. We first convert the LDM into a local image editor by incorporating Blended Diffusion into it. Next we propose an optimization-based solution for the inherent inability of this LDM to accurately reconstruct images. Finally, we address the scenario of performing local edits using thin masks. We evaluate our method against the available baselines both qualitatively and quantitatively and demonstrate that in addition to being faster, our method achieves better precision than the baselines while mitigating some of their artifacts. Project page is available at https://omriavrahami.com/blended-latent-diffusion-page/
翻译:神经图像生成的巨大进步,加上看似万能的视觉语言模型的出现,终于使得基于文本的界面能够创建和编辑图像。 处理通用图像需要一种多样的基本基因模型,因此最新作品使用传播模型,这些模型在多样性方面显示超过GAN。 然而,传播模型的一大缺点是其相对缓慢的推导时间。 在本文中,我们提出了一个加速解决对通用图像进行本地文本驱动编辑的任务的解决方案,其中所希望的编辑仅限于一个用户提供的遮罩。 我们的解决方案利用了最新的文本到模拟Lentnt Difmulation模型(LDDM),该模型通过在较低维度的潜在空间操作加快传播速度。 我们首先将LDMD转换为本地图像编辑, 将Blended Difluction纳入其中。 我们提出一个基于优化的解决方案,因为LDMDM固有的无法准确重建图像。 最后,我们讨论了使用薄面具进行本地编辑的情景。 我们根据现有基准评估了我们的方法, 质量和量化的方法, 并表明除了更快的减缓/ 外, 我们的方法在可获取的IMLA/ prival- pilation 基线上, 我们的方法比其精确性标值更好。