Text-guided image editing can have a transformative impact in supporting creative applications. A key challenge is to generate edits that are faithful to input text prompts, while consistent with input images. We present Imagen Editor, a cascaded diffusion model built, by fine-tuning Imagen on text-guided image inpainting. Imagen Editor's edits are faithful to the text prompts, which is accomplished by using object detectors to propose inpainting masks during training. In addition, Imagen Editor captures fine details in the input image by conditioning the cascaded pipeline on the original high resolution image. To improve qualitative and quantitative evaluation, we introduce EditBench, a systematic benchmark for text-guided image inpainting. EditBench evaluates inpainting edits on natural and generated images exploring objects, attributes, and scenes. Through extensive human evaluation on EditBench, we find that object-masking during training leads to across-the-board improvements in text-image alignment -- such that Imagen Editor is preferred over DALL-E 2 and Stable Diffusion -- and, as a cohort, these models are better at object-rendering than text-rendering, and handle material/color/size attributes better than count/shape attributes.
翻译:文本制图像编辑可以在支持创造性应用程序方面产生变革性影响。 一个关键的挑战是如何生成忠实于输入文本提示的编辑,同时符合输入图像。 我们展示成像编辑器, 这是一个通过微调文本制导图像成像成像成像制作的连锁扩散模型。 成像编辑器忠实于文本提示, 这是通过在培训期间使用物体探测器提出涂漆面罩来完成的。 此外, 图像编辑器通过调整原始高分辨率图像上的分层管道来捕捉输入图像中的精细细节。 为了改进质量和数量评价, 我们引入 Ediste Bench, 这是文本制导图像油漆的系统基准。 编辑Bench 评估自然和生成图像的成像, 探索对象、 属性和场景。 通过对 EditBench 进行广泛的人体评价, 我们发现, 培训期间的物体刻画面设计导致文本图像校正的全式改进 -- 图像编辑器比 DALL- E 2 和 StableDifntion -- 以及作为一个组合, 这些模型比图像处理目标/ 属性/ 更好。