Most low-resource languages do not have the necessary resources to create even a substantial monolingual corpus. These languages may often be found in government proceedings but mainly in Portable Document Format (PDF) that contains legacy fonts. Extracting text from these documents to create a monolingual corpus is challenging due to legacy font usage and printer-friendly encoding, which are not optimized for text extraction. Therefore, we propose a simple, automatic, and novel idea that can scale for Tamil, Sinhala, English languages, and many documents along with parallel corpora. Since Tamil and Sinhala are Low-Resource Languages, we improved the performance of Tesseract by employing LSTM-based training on more than 20 legacy fonts to recognize printed characters in these languages. Especially, our model detects code-mixed text, numbers, and special characters from the printed document. It is shown that this approach can reduce the character-level error rate of Tesseract from 6.03 to 2.61 for Tamil (-3.42% relative change) and 7.61 to 4.74 for Sinhala (-2.87% relative change), as well as the word-level error rate from 39.68 to 20.61 for Tamil (-19.07% relative change) and 35.04 to 26.58 for Sinhala (-8.46% relative change) on the test set. Also, our newly created parallel corpus consists of 185.4k, 168.9k, and 181.04k sentences and 2.11M, 2.22M, and 2.33M Words in Tamil, Sinhala, and English respectively. This study shows that fine-tuning Tesseract models on multiple new fonts help to understand the texts and enhances the performance of the OCR. We made newly trained models and the source code for fine-tuning Tesseract, freely available.
翻译:多数低资源语言都不具备必要的资源来创建甚至大量单一语言文件。 由于泰米尔语、僧伽罗语、英语和许多文件以及平行的Corpora语,这些语言往往出现在政府程序中,但主要是在含有遗留字体的可移植文档格式(PDF)中。从这些文件中提取文本来创建单一语言,由于古老的字体使用和对打印机友好的编码不优化,因此具有挑战性。因此,我们提出了一个简单、自动和新颖的想法,可以将泰米尔语、僧伽罗语、英语、英语和许多文件的规模从6.03降至2.61。由于泰米尔语和僧伽罗语是低来源语言,我们通过在20多个传统文档字体上使用基于LSTM的软字体培训来识别这些语言中的印刷字符,改进了Tesseract的性功能。我们的模式可以将Tesseract的字符级误差率从6.03降至2.63(3.42%相对变化)和Sinhala语(2.87 % 相对变化) 以及Sin-58M 相对温度显示从39.68到新版本。