OCR校后校正后,为OCR进行有历史意识的半监督学习 (Lexically Aware Semi-Supervised Learning for OCR Post-Correction)

Much of the existing linguistic data in many languages of the world is locked away in non-digitized books and documents. Optical character recognition (OCR) can be used to produce digitized text, and previous work has demonstrated the utility of neural post-correction methods that improve the results of general-purpose OCR systems on recognition of less-well-resourced languages. However, these methods rely on manually curated post-correction data, which are relatively scarce compared to the non-annotated raw images that need to be digitized. In this paper, we present a semi-supervised learning method that makes it possible to utilize these raw images to improve performance, specifically through the use of self-training, a technique where a model is iteratively trained on its own outputs. In addition, to enforce consistency in the recognized vocabulary, we introduce a lexically-aware decoding method that augments the neural post-correction model with a count-based language model constructed from the recognized texts, implemented using weighted finite-state automata (WFSA) for efficient and effective decoding. Results on four endangered languages demonstrate the utility of the proposed method, with relative error reductions of 15-29%, where we find the combination of self-training and lexically-aware decoding essential for achieving consistent improvements. Data and code are available at https://shrutirij.github.io/ocr-el/.

翻译：光字符识别(OCR)可用于制作数字化文本,先前的工作已经证明神经后校正方法的效用,这些方法改进了一般用途OCR系统在承认资源较少的语言方面的结果;然而,这些方法依赖人工整理后校正数据,与需要数字化的非附加说明的原始图像相比,这些数据相对较少。本文介绍了一种半监督的学习方法,它使得有可能利用这些原始图像改进性能,特别是通过使用自我培训,一种模型在自身产出方面迭接培训的技术。此外,为了在公认的词汇中加强一致性,我们采用了一种具有法律意识的校正方法,用根据公认的文本建立的基于数字的校正后语言模型来增强神经后校正模型,采用加权的定数自动解码(WFSA)来高效和有效地解译。关于四种濒危语言的成果展示了以自己为主的图像的实用性,特别是使用自我培训,在15-29级的常规中,我们找到了一致的自译/自译法。

相关内容

光学字符识别

关注 44

OCR （Optical Character Recognition，光学字符识别）是指电子设备（例如扫描仪或数码相机）检查纸上打印的字符，通过检测暗、亮的模式确定其形状，然后用字符识别方法将形状翻译成计算机文字的过程；即，针对印刷体字符，采用光学的方式将纸质文档中的文字转换成为黑白点阵的图像文件，并通过识别软件将图像中的文字转换成文本格式，供文字处理软件进一步编辑加工的技术。

【MIT】自监督几何感知，22页ppt，Self-supervised Geometric Perception

专知会员服务

23+阅读 · 2021年6月3日

100+篇《自监督学习(Self-Supervised Learning)》论文最新合集

专知会员服务

165+阅读 · 2020年3月18日

CVPR 2020 论文开源项目合集

专知会员服务

110+阅读 · 2020年3月12日

【AAAI2020接受论文】多任务自监督学习的不流利检测，Multi-Task Self-Supervised Learning for Disfluency Detection

专知会员服务

14+阅读 · 2019年11月11日