We introduce a novel method to automatically generate an artistic typography by stylizing one or more letter fonts to visually convey the semantics of an input word, while ensuring that the output remains readable. To address an assortment of challenges with our task at hand including conflicting goals (artistic stylization vs. legibility), lack of ground truth, and immense search space, our approach utilizes large language models to bridge texts and visual images for stylization and build an unsupervised generative model with a diffusion model backbone. Specifically, we employ the denoising generator in Latent Diffusion Model (LDM), with the key addition of a CNN-based discriminator to adapt the input style onto the input text. The discriminator uses rasterized images of a given letter/word font as real samples and output of the denoising generator as fake samples. Our model is coined DS-Fusion for discriminated and stylized diffusion. We showcase the quality and versatility of our method through numerous examples, qualitative and quantitative evaluation, as well as ablation studies. User studies comparing to strong baselines including CLIPDraw and DALL-E 2, as well as artist-crafted typographies, demonstrate strong performance of DS-Fusion.
翻译:我们介绍了一种新方法,通过将一个或多个字母字体进行风格化处理来自动生成艺术排版,以视觉传达输入单词的含义,同时确保输出易读。为了解决我们面临的各种挑战,包括矛盾的目标(艺术风格化与易读性)、缺乏基准以及巨大的搜索空间,我们的方法利用大型语言模型来桥接文本和视觉图像以实现风格化,并建立基于扩散模型骨架的无监督生成模型。具体而言,我们利用隐式扩散结构中的去噪生成器,加入基于CNN的鉴别器来将输入的字体风格应用到输入的文本上。鉴别器将给定字母/单词字体的栅格化图像作为真实样本,将去噪生成器的输出作为虚假样本。我们的模型称为DS-Fusion,即区分和风格化扩散。我们通过众多实例、定量和定性评估以及消融研究展示了我们方法的质量和多样性。用户研究将其与包括 CLIPDraw 和 DALL-E 2 在内的强基准以及艺术家制作的排版进行比较,证明了DS-Fusion的强大性能。