Text-guided image editing can have a transformative impact in supporting creative applications. A key challenge is to generate edits that are faithful to input text prompts, while consistent with input images. We present Imagen Editor, a cascaded diffusion model built, by fine-tuning Imagen on text-guided image inpainting. Imagen Editor's edits are faithful to the text prompts, which is accomplished by using object detectors to propose inpainting masks during training. In addition, Imagen Editor captures fine details in the input image by conditioning the cascaded pipeline on the original high resolution image. To improve qualitative and quantitative evaluation, we introduce EditBench, a systematic benchmark for text-guided image inpainting. EditBench evaluates inpainting edits on natural and generated images exploring objects, attributes, and scenes. Through extensive human evaluation on EditBench, we find that object-masking during training leads to across-the-board improvements in text-image alignment -- such that Imagen Editor is preferred over DALL-E 2 and Stable Diffusion -- and, as a cohort, these models are better at object-rendering than text-rendering, and handle material/color/size attributes better than count/shape attributes.
翻译:文本引导的图像编辑可以在支持创造性应用方面产生革命性的影响。其中一个关键挑战是生成忠实于输入文本提示的编辑,并与输入图像一致。我们提出了Imagen Editor,这是一个级联扩散模型,通过在文本引导的图像修复上对Imagen进行微调。Imagen Editor的编辑对文本提示忠实,这是通过在训练期间使用对象检测器提出修复蒙版来实现的。此外,Imagen Editor通过将级联流程条件化于原始高分辨率图像来捕获输入图像的细节。为了改善定性和定量评估,我们引入了EditBench,一个用于文本引导的图像修复的系统化评估基准。EditBench评估自然和生成图像上的修复,探索物体、属性和场景。通过对EditBench进行广泛的人类评估,我们发现在训练过程中使用对象掩模可导致文本-图像对齐的整体改进,使得Imagen Editor优于DALL-E 2和稳定扩散,并且这些模型作为同伴相对于文本渲染更擅长物体渲染,并且可以处理材料/颜色/大小属性而不是计数/形状属性。