Plain text has become a prevalent interface for text-to-image synthesis. However, its limited customization options hinder users from accurately describing desired outputs. For example, plain text makes it hard to specify continuous quantities, such as the precise RGB color value or importance of each word. Furthermore, creating detailed text prompts for complex scenes is tedious for humans to write and challenging for text encoders to interpret. To address these challenges, we propose using a rich-text editor supporting formats such as font style, size, color, and footnote. We extract each word's attributes from rich text to enable local style control, explicit token reweighting, precise color rendering, and detailed region synthesis. We achieve these capabilities through a region-based diffusion process. We first obtain each word's region based on cross-attention maps of a vanilla diffusion process using plain text. For each region, we enforce its text attributes by creating region-specific detailed prompts and applying region-specific guidance. We present various examples of image generation from rich text and demonstrate that our method outperforms strong baselines with quantitative evaluations.
翻译:丰富文本的表达性文本到图像生成
纯文本已成为文本到图像合成的广泛接口。然而,其有限的定制选项阻碍了用户准确描述所需的输出。例如,纯文本使得指定连续数量(如精确的RGB颜色值或每个单词的重要性)变得困难。此外,为复杂场景创建详细的文本提示对于人类编写而言很繁琐,对于文本编码器也很具有挑战性。为了解决这些挑战,我们建议使用支持格式(例如字体样式,大小,颜色和脚注)的富文本编辑器。我们从富文本中提取每个单词的属性,以实现本地样式控制、明确的标记重新加权、精确的颜色渲染和详细的区域合成。我们通过基于区域的扩散过程实现这些功能。我们首先通过使用纯文本的交叉注意力图的基本扩散过程获取每个单词的区域。对于每个区域,我们通过创建区域特定的详细提示并应用区域特定的指导来强制执行其文本属性。我们展示了来自丰富文本的图像生成的各种例子,并证明了我们的方法通过定量评估优于强基线。