Image inpainting task refers to erasing unwanted pixels from images and filling them in a semantically consistent and realistic way. Traditionally, the pixels that are wished to be erased are defined with binary masks. From the application point of view, a user needs to generate the masks for the objects they would like to remove which can be time-consuming and prone to errors. In this work, we are interested in an image inpainting algorithm that estimates which object to be removed based on natural language input and also removes it, simultaneously. For this purpose, first, we construct a dataset named GQA-Inpaint for this task which will be released soon. Second, we present a novel inpainting framework, Inst-Inpaint, that can remove objects from images based on the instructions given as text prompts. We set various GAN and diffusion-based baselines and run experiments on synthetic and real image datasets. We compare methods with different evaluation metrics that measure the quality and accuracy of the models and show significant quantitative and qualitative improvements.
翻译:图像修复任务是指从图像中擦除不必要的像素并以语义一致和真实的方式填充它们的过程。传统上,希望擦除的像素由二进制掩码定义。从应用角度看,用户需要为他们想要移除的对象生成掩码,这可能耗时且容易出错。本文旨在探索一种能够根据自然语言输入来估计要移除哪个对象并在此同时移除它的图像修复算法。为此,我们首先构建了一个名为 GQA-Inpaint 的数据集,该数据集将很快发布。其次,我们提出了一种新的修复框架 Inst-Inpaint,在文本提示的指导下可以从图像中移除对象。我们设置了各种GAN和基于扩散的基线,并在合成和真实图像数据集上运行实验。我们使用不同的评估指标来比较各种方法,这些指标衡量了模型的质量和准确性,并展示了在定量和定性方面的显著改进。