Digitization of historical documents is a challenging task in many digital humanities projects. A popular approach for digitization is to scan the documents into images, and then convert images into text using Optical Character Recognition (OCR) algorithms. However, the outcome of OCR processing of historical documents is usually inaccurate and requires post-processing error correction. This study investigates how crowdsourcing can be utilized to correct OCR errors in historical text collections, and which crowdsourcing methodology is the most effective in different scenarios and for various research objectives. A series of experiments with different micro-task's structures and text lengths was conducted with 753 workers on the Amazon's Mechanical Turk platform. The workers had to fix OCR errors in a selected historical text. To analyze the results, new accuracy and efficiency measures have been devised. The analysis suggests that in terms of accuracy, the optimal text length is medium (paragraph-size) and the optimal structure of the experiment is two-phase with a scanned image. In terms of efficiency, the best results were obtained when using longer text in the single-stage structure with no image. The study provides practical recommendations to researchers on how to build the optimal crowdsourcing task for OCR post-correction. The developed methodology can also be utilized to create golden standard historical texts for automatic OCR post-correction. This is the first attempt to systematically investigate the influence of various factors on crowdsourcing-based OCR post-correction and propose an optimal strategy for this process.
翻译:在许多数字人文学项目中,历史文件的数字化是一项具有挑战性的任务。数字化的流行方法是将文件扫描成图像,然后使用光学字符识别算法将图像转换成文字。然而,光学字符识别(OCR)处理历史文件的结果通常不准确,需要后处理错误校正。这项研究调查了如何利用众包纠正历史文本收藏中的OCR错误,以及何种众包方法在不同情景和各种研究目标中最为有效。在亚马逊机械土耳其平台上与753名工人进行了一系列微任务结构和文字长度实验。工人不得不在选定的历史文本中修补OCR错误。为了分析结果、新的准确性和效率措施已经制定。分析表明,从准确性看,最佳文本长度是中文本(段落大小),试验的最佳结构是扫描图像的两阶段。在效率方面,在使用单一阶段结构中较长的文本而没有图像的情况下,获得了最佳结果。该研究向研究人员提供了如何在选定的历史文本上尝试建立最佳的OCR(O-CR)级(O-CR)后,这是用于最佳的对最佳的粉色分析方法。