美学廉价，文本为王：面向OCR任务的最先进生成模型实证评估 (Aesthetics is Cheap, Show me the Text: An Empirical Evaluation of State-of-the-Art Generative Models for OCR)

Text image is a unique and crucial information medium that integrates visual aesthetics and linguistic semantics in modern e-society. Due to their subtlety and complexity, the generation of text images represents a challenging and evolving frontier in the image generation field. The recent surge of specialized image generators (\emph{e.g.}, Flux-series) and unified generative models (\emph{e.g.}, GPT-4o), which demonstrate exceptional fidelity, raises a natural question: can they master the intricacies of text image generation and editing? Motivated by this, we assess current state-of-the-art generative models' capabilities in terms of text image generation and editing. We incorporate various typical optical character recognition (OCR) tasks into our evaluation and broaden the concept of text-based generation tasks into OCR generative tasks. We select 33 representative tasks and categorize them into five categories: document, handwritten text, scene text, artistic text, and complex \& layout-rich text. For comprehensive evaluation, we examine six models across both closed-source and open-source domains, using tailored, high-quality image inputs and prompts. Through this evaluation, we draw crucial observations and identify the weaknesses of current generative models for OCR tasks. We argue that photorealistic text image generation and editing should be internalized as foundational skills into general-domain generative models, rather than being delegated to specialized solutions, and we hope this empirical analysis can provide valuable insights for the community to achieve this goal. This evaluation is online and will be continuously updated at our GitHub repository.

翻译：文本图像作为一种独特而关键的信息媒介，在现代电子社会中融合了视觉美学与语言语义。由于其微妙性与复杂性，文本图像的生成代表了图像生成领域中一个具有挑战性且不断发展的前沿。近期涌现的专用图像生成器（例如Flux系列）和统一生成模型（例如GPT-4o）展现出卓越的保真度，这引发了一个自然的问题：它们能否掌握文本图像生成与编辑的复杂性？受此启发，我们评估了当前最先进生成模型在文本图像生成与编辑方面的能力。我们将多种典型的光学字符识别（OCR）任务纳入评估体系，并将基于文本的生成任务概念拓展为OCR生成任务。我们选取了33个代表性任务，并将其归类为五个类别：文档文本、手写文本、场景文本、艺术文本以及复杂与版式丰富的文本。为进行全面评估，我们考察了闭源与开源领域的六个模型，使用定制化的高质量图像输入与提示。通过本次评估，我们得出了关键观察结果，并识别了当前生成模型在OCR任务上的弱点。我们认为，逼真的文本图像生成与编辑应内化为通用领域生成模型的基础能力，而非委派给专用解决方案，我们希望这项实证分析能为学界实现此目标提供有价值的见解。本评估在线进行，并将持续在我们的GitHub仓库中更新。

相关内容

MoDELS

关注 44

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日

Connections between Support Vector Machines, Wasserstein distance and gradient-penalty GANs

专知会员服务

36+阅读 · 2019年10月17日

Deep Learning Based Detection and Correction of Cardiac MR Motion Artefacts During Reconstruction for High-Quality Segmentation

专知会员服务

59+阅读 · 2019年10月17日