Understanding document images (e.g., invoices) is a core but challenging task since it requires complex functions such as reading text and a holistic understanding of the document. Current Visual Document Understanding (VDU) methods outsource the task of reading text to off-the-shelf Optical Character Recognition (OCR) engines and focus on the understanding task with the OCR outputs. Although such OCR-based approaches have shown promising performance, they suffer from 1) high computational costs for using OCR; 2) inflexibility of OCR models on languages or types of document; 3) OCR error propagation to the subsequent process. To address these issues, in this paper, we introduce a novel OCR-free VDU model named Donut, which stands for Document understanding transformer. As the first step in OCR-free VDU research, we propose a simple architecture (i.e., Transformer) with a pre-training objective (i.e., cross-entropy loss). Donut is conceptually simple yet effective. Through extensive experiments and analyses, we show a simple OCR-free VDU model, Donut, achieves state-of-the-art performances on various VDU tasks in terms of both speed and accuracy. In addition, we offer a synthetic data generator that helps the model pre-training to be flexible in various languages and domains. The code, trained model and synthetic data are available at https://github.com/clovaai/donut.
翻译:理解文件图像(例如,发票)是一项核心但具有挑战性的任务,因为它需要复杂的功能,例如阅读文本和全面理解文件。当前视觉文件理解(VDU)的方法将阅读文本的任务外包给现成的光学字符识别引擎,并侧重于与OCR产出有关的理解任务。虽然这种以OCR为基础的方法表现良好,但它们面临着以下困难:(1) 使用OCR的高计算成本;(2) OCR模式在语言或文件类型上的灵活性;(3) OCR错误传播到随后的进程。为了解决这些问题,我们在本文件中采用了名为Donuut的无OCRVDU新式模型。作为无OCR的VDU研究的第一步,我们提出了一个简单的结构(即变异器),其培训前的目标(即交叉损失)是高额计算成本。通过广泛的实验和分析,我们展示了简单的OCR-free VDU模型、Donut、实现文件理解变异格式的状态-艺术性能测试,在各种模型/变异形式上,我们提供各种数据格式的进度。