GlyphDraw: 学习使合成模型连贯地绘制中文字符 (GlyphDraw: Learning to Draw Chinese Characters in Image Synthesis Models Coherently)

Recent breakthroughs in the field of language-guided image generation have yielded impressive achievements, enabling the creation of high-quality and diverse images based on user instructions. Although the synthesis performance is fascinating, one significant limitation of current image generation models is their insufficient ability to generate coherent text within images, particularly for complex glyph structures like Chinese characters. To address this problem, we introduce GlyphDraw, a general learning framework aiming at endowing image generation models with the capacity to generate images embedded with coherent text. To the best of our knowledge, this is the first work in the field of image synthesis to address the generation of Chinese characters. % we first adopt the OCR technique to collect images with Chinese characters as training samples, and extract the text and locations as auxiliary information. We first sophisticatedly design the image-text dataset's construction strategy, then build our model specifically on a diffusion-based image generator and carefully modify the network structure to allow the model to learn drawing Chinese characters with the help of glyph and position information. Furthermore, we maintain the model's open-domain image synthesis capability by preventing catastrophic forgetting by using a variety of training techniques. Extensive qualitative and quantitative experiments demonstrate that our method not only produces accurate Chinese characters as in prompts, but also naturally blends the generated text into the background. Please refer to https://1073521013.github.io/glyph-draw.github.io

翻译：近期，语义引导图像生成领域取得了惊人的成果，能够根据用户的指导创造高质量、多样化的图像。虽然合成性能令人叹为观止，但目前图像生成模型的一个重要限制是它们不足以为图像生成连贯而复杂的字符结构（例如中文字符）生成连贯的文本贡献。为了解决这个问题，我们引入了GlyphDraw，这是一个通用的学习框架，旨在赋予图像生成模型为图像生成连贯的文本贡献。据我们所知，这是关于图像合成的首个中文字符生成工作。%我们首先采用OCR技术收集带有中文字符的图像作为训练样本，并提取辅助信息中的文本和位置。我们先Sophisticly设计图像文本数据集的构建策略，然后特别针对扩散型图像生成器构建我们的模型，并仔细修改网络结构，以便模型在得到了字形和位置信息的帮助下学习绘制中文字符。此外，我们通过使用各种训练技术防止灾难性遗忘，维持模型的开放域图像合成能力。广泛的定性和定量实验表明，我们的方法不仅能够按照提示产生准确的中文字符，而且能够以自然的方式将生成的文本融入背景中。请参阅https://1073521013.github.io/glyph-draw.github.io