Iterating with new and improved OCR solutions enforces decisions to be taken when it comes to targeting the right reprocessing candidates. This especially applies when the underlying data collection is of considerable size and rather diverse in terms of fonts, languages, periods of publication and consequently OCR quality. This article captures the efforts of the National Library of Luxembourg to support those exact decisions. They are crucial in order to guarantee low computational overhead and reduced quality degradation risks, combined with a more quantifiable OCR improvement. In particular, this work explains the methodology of the library with respect to text block level quality assessment. As an extension of this technique, another contribution comes in the form of a regression model that takes the enhancement potential of a new OCR engine into account. They both mark promising approaches, especially for cultural institutions dealing with historic data of lower quality.
翻译:与新的和经过改进的OCR解决方案相配合,在针对合适的后处理候选人时,强制执行决定,这尤其适用于基础数据收集规模大,在字体、语言、出版期以及随后的OCR质量方面差异很大的情况;这篇文章记录了卢森堡国家图书馆为支持这些准确决定所做的努力;这些对于保证低计算间接费用和降低质量退化风险至关重要,同时对OCR进行更量化的改进;特别是,这项工作解释了图书馆在文本区块质量评估方面采用的方法;作为这一技术的延伸,又以回归模型的形式作出了另一项贡献,该模型考虑到新的OCR引擎的增强潜力;它们都标志着有希望的做法,特别是对处理低质量历史数据的文化机构而言。