Text image is a unique and crucial information medium that integrates visual aesthetics and linguistic semantics in modern e-society. Due to their subtlety and complexity, the generation of text images represents a challenging and evolving frontier in the image generation field. The recent surge of specialized image generators (\emph{e.g.}, Flux-series) and unified generative models (\emph{e.g.}, GPT-4o), which demonstrate exceptional fidelity, raises a natural question: can they master the intricacies of text image generation and editing? Motivated by this, we assess current state-of-the-art generative models' capabilities in terms of text image generation and editing. We incorporate various typical optical character recognition (OCR) tasks into our evaluation and broaden the concept of text-based generation tasks into OCR generative tasks. We select 33 representative tasks and categorize them into five categories: document, handwritten text, scene text, artistic text, and complex \& layout-rich text. For comprehensive evaluation, we examine six models across both closed-source and open-source domains, using tailored, high-quality image inputs and prompts. Through this evaluation, we draw crucial observations and identify the weaknesses of current generative models for OCR tasks. We argue that photorealistic text image generation and editing should be internalized as foundational skills into general-domain generative models, rather than being delegated to specialized solutions, and we hope this empirical analysis can provide valuable insights for the community to achieve this goal. This evaluation is online and will be continuously updated at our GitHub repository.
翻译:文本图像作为一种独特而关键的信息媒介,在现代电子社会中融合了视觉美学与语言语义。由于其微妙性与复杂性,文本图像的生成代表了图像生成领域中一个具有挑战性且不断发展的前沿。近期涌现的专用图像生成器(例如Flux系列)和统一生成模型(例如GPT-4o)展现出卓越的保真度,这引发了一个自然的问题:它们能否掌握文本图像生成与编辑的复杂性?受此启发,我们评估了当前最先进生成模型在文本图像生成与编辑方面的能力。我们将多种典型的光学字符识别(OCR)任务纳入评估体系,并将基于文本的生成任务概念拓展为OCR生成任务。我们选取了33个代表性任务,并将其归类为五个类别:文档文本、手写文本、场景文本、艺术文本以及复杂与版式丰富的文本。为进行全面评估,我们考察了闭源与开源领域的六个模型,使用定制化的高质量图像输入与提示。通过本次评估,我们得出了关键观察结果,并识别了当前生成模型在OCR任务上的弱点。我们认为,逼真的文本图像生成与编辑应内化为通用领域生成模型的基础能力,而非委派给专用解决方案,我们希望这项实证分析能为学界实现此目标提供有价值的见解。本评估在线进行,并将持续在我们的GitHub仓库中更新。