SpaText: 可控图像生成的 Spatio-Text 代表 (SpaText: Spatio-Textual Representation for Controllable Image Generation)

Recent text-to-image diffusion models are able to generate convincing results of unprecedented quality. However, it is nearly impossible to control the shapes of different regions/objects or their layout in a fine-grained fashion. Previous attempts to provide such controls were hindered by their reliance on a fixed set of labels. To this end, we present SpaText - a new method for text-to-image generation using open-vocabulary scene control. In addition to a global text prompt that describes the entire scene, the user provides a segmentation map where each region of interest is annotated by a free-form natural language description. Due to lack of large-scale datasets that have a detailed textual description for each region in the image, we choose to leverage the current large-scale text-to-image datasets and base our approach on a novel CLIP-based spatio-textual representation, and show its effectiveness on two state-of-the-art diffusion models: pixel-based and latent-based. In addition, we show how to extend the classifier-free guidance method in diffusion models to the multi-conditional case and present an alternative accelerated inference algorithm. Finally, we offer several automatic evaluation metrics and use them, in addition to FID scores and a user study, to evaluate our method and show that it achieves state-of-the-art results on image generation with free-form textual scene control.

翻译：最近的文本到图像传播模型能够产生前所未有的令人信服的结果。但是, 几乎不可能以细微的形状来控制不同区域/ 对象的形状或其布局。以前提供这种控制的尝试由于依赖固定的标签而受阻。为此, 我们展示了 SpaText - 一种使用开放式词汇表层控制生成文本到图像的新方法。除了描述整个场景的全球文本提示外, 用户还提供一份分区图, 其中每个感兴趣的区域都配有自由格式自然语言描述的附加说明。由于缺少对图像中每个区域都有详细文本描述的大型数据集, 我们选择利用目前的大规模文本到图像数据集, 并将我们的方法建立在基于开放词汇的文版控制上的新CLIP的文本生成方法上, 并展示其在两种最先进的传播模型上的有效性: 平面和潜值基础。此外, 我们展示了如何在图像传播模型中扩大分类- 免费指导方法, 为每个区域提供详细的文本描述图层描述, 我们最后在快速的版本版本数据生成模型中, 展示了一种快速的版本格式, 以快速的版本格式, 展示模型, 展示了多种格式的版本的版本的版本数据分析, 来显示, 以快速版本的版本的版本的版本的版本的版本的版本, 来显示,,, 显示, 并展示的版本的版本的版本的版本的版本的版本的版本的版本的版本的版本的版本的版本的版本,, 显示, 显示, 显示,, 格式的版本的版本的版本的版本的版本的版本的版本的版本的版本的版本的版本的版本的版本的版本的版本的版本的版本的版本的版本的版本的版本的版本的版本的版本的版本的版本的版本的版本的版本的版本的版本的版本的版本的版本的版本的版本的版本的版本的版本的版本的版本的版本的版本的版本的版本的版本的版本的版本的版本的版本的版本的版本的版本的版本的版本的版本的版本的版本的版本的版本的版本的版本的版本的版本的版本的版本的版本的版本的版本的版本的版本的版本的版本的版本的版本的版本的版本的版本的版本的版本的版本的版本的版本的版本的版本的版本的版本的版本的版本