When manipulating an object, existing text-to-image diffusion models often ignore the shape of the object and generate content that is incorrectly scaled, cut off, or replaced with background content. We propose a training-free method, Shape-Guided Diffusion, that modifies pretrained diffusion models to be sensitive to shape input specified by a user or automatically inferred from text. We use a novel Inside-Outside Attention mechanism during the inversion and generation process to apply this shape constraint to the cross- and self-attention maps. Our mechanism designates which spatial region is the object (inside) vs. background (outside) then associates edits specified by text prompts to the correct region. We demonstrate the efficacy of our method on the shape-guided editing task, where the model must replace an object according to a text prompt and object mask. We curate a new ShapePrompts benchmark derived from MS-COCO and achieve SOTA results in shape faithfulness without a degradation in text alignment or image realism according to both automatic metrics and annotator ratings. Our data and code will be made available at https://shape-guided-diffusion.github.io.
翻译:在操作对象时,现有的文本到图像扩散模型经常忽略对象的形状,并生成缩放不正确,被截断或被替换成背景内容的内容。我们提出了一种无需训练的方法——基于形状导向的扩散,以修改预训练扩散模型,使其对用户指定的形状输入或从文本自动推断出敏感。我们在反演和生成过程中使用一种新的内外部注意力机制,将形状约束应用于交叉和自注意力映射。我们的机制指定了哪个空间区域是对象(内部)而哪个是背景(外部),然后将文本提示指定的编辑与正确的区域关联。我们在形状导向的编辑任务中展示了我们的方法的有效性,其中模型必须根据文本提示和对象掩膜替换对象。我们从 MS-COCO 汇编了一个新的 ShapePrompts 基准,并实现了形状忠实度的 SOTA 结果,没有降低文本对齐或图像逼真度,根据自动度量和标注者评分。我们的数据和代码将在 https://shape-guided-diffusion.github.io 上提供。