Scholars in the humanities rely heavily on ancient manuscripts to study history, religion, and socio-political structures in the past. Many efforts have been devoted to digitizing these precious manuscripts using OCR technology, but most manuscripts were blemished over the centuries so that an Optical Character Recognition (OCR) program cannot be expected to capture faded graphs and stains on pages. This work presents a neural spelling correction model built on Google OCR-ed Tibetan Manuscripts to auto-correct OCR-ed noisy output. This paper is divided into four sections: dataset, model architecture, training and analysis. First, we feature-engineered our raw Tibetan etext corpus into two sets of structured data frames -- a set of paired toy data and a set of paired real data. Then, we implemented a Confidence Score mechanism into the Transformer architecture to perform spelling correction tasks. According to the Loss and Character Error Rate, our Transformer + Confidence score mechanism architecture proves to be superior to Transformer, LSTM-2-LSTM and GRU-2-GRU architectures. Finally, to examine the robustness of our model, we analyzed erroneous tokens, visualized Attention and Self-Attention heatmaps in our model.
翻译:人文学者在研究历史、宗教和社会政治结构方面很大程度上依赖古老手稿。为了将这些宝贵手稿数字化,已经投入了许多的工作,但大多数手稿在数个世纪中都破损不堪,因而光学字符识别(OCR)程序无法捕捉化掉的文字和页面上的污渍。这项工作提出了一种建立在 Google OCR 藏文手稿基础上的神经拼写修正模型,以自动纠正 OCR 输出的噪声。本文分为四个部分:数据集、模型架构、训练和分析。首先,我们将原始的藏文电子文本语料库功能工程化为两组结构化数据框 -- 一组是配对的玩具数据,另一组是配对的真实数据。然后,我们在 Transformer 架构中实现了置信度得分机制来执行拼写修正任务。根据损失和字符错误率,我们 Transformer + 置信度得分机制架构证明优于 Transformer、LSTM-2- LSTM、GRU-2-GRU 架构。最后,为了检验我们模型的稳健性,我们分析了错误的标记,并在我们的模型中可视化了注意力和自我注意力热图。