Iterating with new and improved OCR solutions enforces decision making when it comes to targeting the right candidates for reprocessing. This especially applies when the underlying data collection is of considerable size and rather diverse in terms of fonts, languages, periods of publication and consequently OCR quality. This article captures the efforts of the National Library of Luxembourg to support those targeting decisions. They are crucial in order to guarantee low computational overhead and reduced quality degradation risks, combined with a more quantifiable OCR improvement. In particular, this work explains the methodology of the library with respect to text block level quality assessment. Through extension of this technique, a regression model, that is able to take into account the enhancement potential of a new OCR engine, is also presented. They both mark promising approaches, especially for cultural institutions dealing with historical data of lower quality.
翻译:结合新的和经过改进的OCR解决方案,在针对合适的后处理对象时,强制执行决策,特别是当基本数据收集在字体、语言、出版期和OCR质量方面规模很大,而且差别很大时,这尤其适用;这篇文章记录了卢森堡国家图书馆为支持针对这些决定所作的努力;对于保证低计算间接费用和降低质量退化风险,加上更量化的OCR改进至关重要;特别是,这项工作解释了图书馆在文本块级质量评估方面的方法;通过推广这一技术,还介绍了一种回归模型,能够考虑到新的OCR引擎的增强潜力;它们都标志着有希望的做法,特别是对于处理低质量历史数据的文化机构而言。