We have built SinSpell, a comprehensive spelling checker for the Sinhala language which is spoken by over 16 million people, mainly in Sri Lanka. However, until recently, Sinhala had no spelling checker with acceptable coverage. Sinspell is still the only open source Sinhala spelling checker. SinSpell identifies possible spelling errors and suggests corrections. It also contains a module which auto-corrects evident errors. To maintain accuracy, SinSpell was designed as a rule-based system based on Hunspell. A set of words was compiled from several sources and verified. These were divided into morphological classes, and the valid roots, suffixes and prefixes for each class were identified, together with lists of irregular words and exceptions. The errors in a corpus of Sinhala documents were analysed and commonly misspelled words and types of common errors were identified. We found that the most common errors were in vowel length and similar sounding letters. Errors due to incorrect typing and encoding were also found. This analysis was used to develop the suggestion generator and auto-corrector.
翻译:我们建造了僧伽罗语综合拼写检查器SinSpell,这是僧伽罗语的综合拼写检查器,有1 600多万人使用,主要在斯里兰卡。然而,直到最近,僧伽罗语还没有一个可接受的拼写检查器。僧伽罗仍然是唯一的开源源Sinhala拼写检查器。辛斯佩尔找出了可能的拼写错误,并提出了更正建议。它还包括一个模块,自动纠正明显的错误。为了保持准确性,辛斯佩尔设计了一个基于Hunspell的基于规则的系统。从多个来源汇编了一套词,并进行了核实。这些词被分为了形态类别,确定了每个类别的有效根、后缀和前缀,并列出了不规则的单词和例外清单。对辛哈拉文中的错误进行了分析,并找出了常见的拼写错误和常见错误类型。我们发现最常见的错误是格长的和相似的发音字母。还发现了因输入错误和编码错误而产生的错误。这一分析用于开发建议生成器和自动校正。