扩展全版扫描接收图像的无文字本地化OCR(TROCR) (Extending TrOCR for Text Localization-Free OCR of Full-Page Scanned Receipt Images)

Digitization of scanned receipts aims to extract text from receipt images and save it into structured documents. This is usually split into two sub-tasks: text localization and optical character recognition (OCR). Most existing OCR models only focus on the cropped text instance images, which require the bounding box information provided by a text region detection model. Introducing an additional detector to identify the text instance images in advance is inefficient, however instance-level OCR models have very low accuracy when processing the whole image for the document-level OCR, such as receipt images containing multiple text lines arranged in various layouts. To this end, we propose a localization-free document-level OCR model for transcribing all the characters in a receipt image into an ordered sequence end-to-end. Specifically, we finetune the pretrained Transformer-based instance-level model TrOCR with randomly cropped image chunks, and gradually increase the image chunk size to generalize the recognition ability from instance images to full-page images. In our experiments on the SROIE receipt OCR dataset, the model finetuned with our strategy achieved 64.4 F1-score and a 22.8% character error rates (CER) on the word-level and character-level metrics, respectively, which outperforms the baseline results with 48.5 F1-score and 50.6% CER. The best model, which splits the full image into 15 equally sized chunks, gives 87.8 F1-score and 4.98% CER with minimal additional pre or post-processing of the output. Moreover, the characters in the generated document-level sequences are arranged in the reading order, which is practical for real-world applications.

翻译：扫描收据的数字化数字化旨在从接收图像中提取文本并将其保存为结构化文档。这通常分为两个子任务: 文本本地化和光学字符识别( OCR) 。大多数现有的 OCR 模型仅侧重于裁剪文本实例图像, 需要文本区域检测模型提供的捆绑框信息。引入额外的检测器来提前识别文本实例图像是无效的, 然而, 例级 OCR 模型在处理文件级别 OCR 的整个图像时, 准确度非常低, 如接收包含多个文本行的图像, 以各种布局排列。为此, 我们建议使用一个无本地化的 OCR 模式, 将接收图像中的所有字符转换成有序序列端到终端。具体地说, 我们用随机裁剪裁的图像块来微调基于图像的变压模型 TrOCRRRRR, 逐渐增加图像块大小, 将模型从实例图像到全页图像。在 SROIE 接收 OCR 数据设置的实验中, 模型将50.4 CER 级的文档级的字符级递校正值, F1 级和22 级级级级级级级级级级级级级级级级级级级级级级级级级级级级级级级级级级级级级级级级级级级级级级级级码级级级级级级级级级级级级级级级级级级级级级级级级级级级级级级级级级级级级级级级级级级级级级级级级级级级级级级级级级级级级级级级级级级级级级级级级级级级级级级