Text-to-image models give rise to workflows which often begin with an exploration step, where users sift through a large collection of generated images. The global nature of the text-to-image generation process prevents users from narrowing their exploration to a particular object in the image. In this paper, we present a technique to generate a collection of images that depicts variations in the shape of a specific object, enabling an object-level shape exploration process. Creating plausible variations is challenging as it requires control over the shape of the generated object while respecting its semantics. A particular challenge when generating object variations is accurately localizing the manipulation applied over the object's shape. We introduce a prompt-mixing technique that switches between prompts along the denoising process to attain a variety of shape choices. To localize the image-space operation, we present two techniques that use the self-attention layers in conjunction with the cross-attention layers. Moreover, we show that these localization techniques are general and effective beyond the scope of generating object variations. Extensive results and comparisons demonstrate the effectiveness of our method in generating object variations, and the competence of our localization techniques.
翻译:文本到图像模型引发的工作流程通常始于探索步骤,用户需要筛选大量生成的图像。文本到图像生成过程的全局性质阻止用户将探索范围缩小到图像中的特定物体。本文提出了一种技术,用于生成描绘特定物体形状变化的图像集,从而实现物体级别的形状探索过程。创建真实的变化具有挑战性,因为需要控制生成的物体形状,同时尊重其语义。当生成物体变化时,特别的挑战是准确地定位应用于物体形状的变换。我们引入了一种提示混合技术,可以在去噪过程中在提示之间切换,从而获得各种形状选择。为了定位图像空间操作,我们提出了两种技术,使用自注意力层与交叉注意力层结合。此外,我们展示了这些本地化技术的普适性和有效性,超出了生成物体变化的范围。广泛的结果和比较证明了我们的方法在生成物体变化方面的有效性,以及我们的本地化技术的竞争力。