Scene text editing is a challenging task that involves modifying or inserting specified texts in an image while maintaining its natural and realistic appearance. Most previous approaches to this task rely on style-transfer models that crop out text regions and feed them into image transfer models, such as GANs. However, these methods are limited in their ability to change text style and are unable to insert texts into images. Recent advances in diffusion models have shown promise in overcoming these limitations with text-conditional image editing. However, our empirical analysis reveals that state-of-the-art diffusion models struggle with rendering correct text and controlling text style. To address these problems, we propose DIFFSTE to improve pre-trained diffusion models with a dual encoder design, which includes a character encoder for better text legibility and an instruction encoder for better style control. An instruction tuning framework is introduced to train our model to learn the mapping from the text instruction to the corresponding image with either the specified style or the style of the surrounding texts in the background. Such a training method further brings our method the zero-shot generalization ability to the following three scenarios: generating text with unseen font variation, e.g., italic and bold, mixing different fonts to construct a new font, and using more relaxed forms of natural language as the instructions to guide the generation task. We evaluate our approach on five datasets and demonstrate its superior performance in terms of text correctness, image naturalness, and style controllability. Our code is publicly available. https://github.com/UCSB-NLP-Chang/DiffSTE
翻译:场景文本编辑是一项具有挑战性的任务,涉及在图片中修改或插入指定文本并保持其自然和逼真的外观。大多数以前的方法依赖于风格转移模型,这些模型将文本区域剪裁出来并将其馈送到图像转移模型,例如 GANs。然而,这些方法在改变文本风格方面受限,并且无法将文本插入到图像中。最近扩散模型的进展显示出克服这些限制的希望,同时达到基于文本的图像编辑。然而,我们的实证分析表明,最先进的扩散模型在渲染正确的文本和控制文本风格方面存在困难。为了解决这些问题,我们提出 DIFFSTE,通过双编码器设计来改进预训练的扩散模型,其中包括一个字符编码器,以实现更好的文本可读性,和一个说明编码器,以实现更好的风格控制。引入了一个说明调整框架,使我们的模型能够学习从文本说明到相应图像之间的映射,具有指定风格或背景中周围文本的风格。这样的训练方法还为我们的方法带来了零摄氏度的广义能力,适用于以下三种情况:生成带有未见字体变化(例如斜体和粗体)的文本,混合不同字体以构建新字体,并使用更宽松的自然语言形式作为指令来指导生成任务。我们在五个数据集上对我们的方法进行评估,并展示了它在文本正确性、图像自然性和风格可控性方面的卓越性能。我们的代码是公开的。https://github.com/UCSB-NLP-Chang/DiffSTE