Recent large-scale text-driven synthesis models have attracted much attention thanks to their remarkable capabilities of generating highly diverse images that follow given text prompts. Such text-based synthesis methods are particularly appealing to humans who are used to verbally describe their intent. Therefore, it is only natural to extend the text-driven image synthesis to text-driven image editing. Editing is challenging for these generative models, since an innate property of an editing technique is to preserve most of the original image, while in the text-based models, even a small modification of the text prompt often leads to a completely different outcome. State-of-the-art methods mitigate this by requiring the users to provide a spatial mask to localize the edit, hence, ignoring the original structure and content within the masked region. In this paper, we pursue an intuitive prompt-to-prompt editing framework, where the edits are controlled by text only. To this end, we analyze a text-conditioned model in depth and observe that the cross-attention layers are the key to controlling the relation between the spatial layout of the image to each word in the prompt. With this observation, we present several applications which monitor the image synthesis by editing the textual prompt only. This includes localized editing by replacing a word, global editing by adding a specification, and even delicately controlling the extent to which a word is reflected in the image. We present our results over diverse images and prompts, demonstrating high-quality synthesis and fidelity to the edited prompts.
翻译:最近大规模文本驱动的合成模型引起了人们的极大关注,因为它们具有制作高度多样化的图像的非凡能力,并遵循了文本提示。这种基于文本的合成方法对用于口头描述其意图的人特别具有吸引力。因此,将文本驱动的图像合成扩展至文本驱动的图像编辑。由于编辑技术的固有属性是保存大部分原始图像,而基于文本的模型中,即使是对文本的微小修改也往往导致完全不同的结果。 以文字为基础的合成方法减轻了这一点,要求用户提供空间遮罩,使编辑本地化,从而忽略了蒙面区域内的原始结构和内容。因此,我们自然会将文本驱动的图像合成扩展至文本驱动的图像编辑。在本文中,我们追求一个直截了当的即时即出现的编辑框架,编辑技术的固有属性在于保存大多数原始图像,为此,我们从深度上分析一个有文字限制的模型,并观察到交叉注意层是控制图像与每个字词在迅速反映的图像之间关系的关键。我们通过这一观察,通过快速的精确的翻译,我们用一个精细化的文字来取代了当前版本。