There has been recent interest in improving optical character recognition (OCR) for endangered languages, particularly because a large number of documents and books in these languages are not in machine-readable formats. The performance of OCR systems is typically evaluated using automatic metrics such as character and word error rates. While error rates are useful for the comparison of different models and systems, they do not measure whether and how the transcriptions produced from OCR tools are useful to downstream users. In this paper, we present a human-centric evaluation of OCR systems, focusing on the Kwak'wala language as a case study. With a user study, we show that utilizing OCR reduces the time spent in the manual transcription of culturally valuable documents -- a task that is often undertaken by endangered language community members and researchers -- by over 50%. Our results demonstrate the potential benefits that OCR tools can have on downstream language documentation and revitalization efforts.
翻译:最近人们对提高濒危语文的光学字符识别(OCR)感兴趣,特别是因为这些语文的大量文件和书籍不是机器可读格式。对OCR系统的性能通常使用性格和字词误差率等自动衡量标准进行评估。虽然错误率对比较不同的模型和系统有用,但是它们不能衡量OCR工具产生的抄录是否以及如何对下游用户有用。在本文件中,我们介绍了对OCR系统的以人为中心的评价,重点是Kwak'wala语文,作为案例研究。在一项用户研究中,我们显示使用OCR将文化上有价值的文件人工抄录所花费的时间减少50%以上,而这项任务往往是濒危语文界成员和研究人员完成的。我们的结果表明OCR工具对下游语文文档和振兴工作的潜在好处。</s>