Generic image inpainting aims to complete a corrupted image by borrowing surrounding information, which barely generates novel content. By contrast, multi-modal inpainting provides more flexible and useful controls on the inpainted content, \eg, a text prompt can be used to describe an object with richer attributes, and a mask can be used to constrain the shape of the inpainted object rather than being only considered as a missing area. We propose a new diffusion-based model named SmartBrush for completing a missing region with an object using both text and shape-guidance. While previous work such as DALLE-2 and Stable Diffusion can do text-guided inapinting they do not support shape guidance and tend to modify background texture surrounding the generated object. Our model incorporates both text and shape guidance with precision control. To preserve the background better, we propose a novel training and sampling strategy by augmenting the diffusion U-net with object-mask prediction. Lastly, we introduce a multi-task training strategy by jointly training inpainting with text-to-image generation to leverage more training data. We conduct extensive experiments showing that our model outperforms all baselines in terms of visual quality, mask controllability, and background preservation.
翻译:通用图像映射图的目的是通过借用周围信息来完成一个损坏的图像,而周围信息几乎无法产生新内容。相比之下,多式画像能够提供更灵活、更有用的对涂漆内容的控制。 相比之下, 多式画像能够提供更灵活、更有用的控制。 多式画像可以用来描述具有更丰富属性的对象, 并且可以使用遮罩来限制涂漆对象的形状, 而不是仅仅被视为缺失的区域。 我们提出一个新的基于传播的模型SmartBrush, 用于使用文本和形状指导完成一个缺失的区域。 虽然以前的工作, 如 DALLE-2 和 Stablable Difmulation, 能够用文字导引导来做文字导出的内容导出的内容导出的内容导出的内容导出的内容导出的内容导出的内容导出的内容导出的内容导出的内容导出的内容导出的内容导出的内容导出的内容导出的内容导出的内容导出的内容导出。 我们的模型结合了文本和带有精确控制的导出内容导出内容,为了更好地保护背景,我们提出了一种新型的培训和取样战略。 最后,我们采用了一种多式培训策略, 将文字到图像生成数据导导出, 展示所有模型的模制模制模制模的模型的模制模制模制模制模制模制程。