重开OCR:质量评估和加强预测的机械学习方法 (Rerunning OCR: A Machine Learning Approach to Quality Assessment and Enhancement Prediction)

Iterating with new and improved OCR solutions enforces decisions to be taken when it comes to targeting the right reprocessing candidates. This especially applies when the underlying data collection is of considerable size and rather diverse in terms of fonts, languages, periods of publication and consequently OCR quality. This article captures the efforts of the National Library of Luxembourg to support those exact decisions. They are crucial in order to guarantee low computational overhead and reduced quality degradation risks, combined with a more quantifiable OCR improvement. In particular, this work explains the methodology of the library with respect to text block level quality assessment. As an extension of this technique, another contribution comes in the form of a regression model that takes the enhancement potential of a new OCR engine into account. They both mark promising approaches, especially for cultural institutions dealing with historic data of lower quality.

翻译：与新的和经过改进的OCR解决方案相配合,在针对合适的后处理候选人时,强制执行决定,这尤其适用于基础数据收集规模大,在字体、语言、出版期以及随后的OCR质量方面差异很大的情况;这篇文章记录了卢森堡国家图书馆为支持这些准确决定所做的努力;这些对于保证低计算间接费用和降低质量退化风险至关重要,同时对OCR进行更量化的改进;特别是,这项工作解释了图书馆在文本区块质量评估方面采用的方法;作为这一技术的延伸,又以回归模型的形式作出了另一项贡献,该模型考虑到新的OCR引擎的增强潜力;它们都标志着有希望的做法,特别是对处理低质量历史数据的文化机构而言。

相关内容

光学字符识别

关注 44

OCR （Optical Character Recognition，光学字符识别）是指电子设备（例如扫描仪或数码相机）检查纸上打印的字符，通过检测暗、亮的模式确定其形状，然后用字符识别方法将形状翻译成计算机文字的过程；即，针对印刷体字符，采用光学的方式将纸质文档中的文字转换成为黑白点阵的图像文件，并通过识别软件将图像中的文字转换成文本格式，供文字处理软件进一步编辑加工的技术。

UCM《机器学习导论笔记》，80页pdf CSE176 Introduction to Machine Learning

专知会员服务

31+阅读 · 2021年9月29日

专知会员服务

39+阅读 · 2020年11月3日

【伯克利】机器学习蛋白质工程，Machine learning for protein engineering，83页ppt

专知会员服务

36+阅读 · 2020年5月9日

【干货书】真实机器学习，264页pdf，Real-World Machine Learning

专知会员服务

115+阅读 · 2020年4月5日

【文献综述】分布式机器学习综述论文，33页pdf，A Survey on Distributed Machine Learning