Spelling error correction is the task of identifying and rectifying misspelled words in texts. It is a potential and active research topic in Natural Language Processing because of numerous applications in human language understanding. The phonetically or visually similar yet semantically distinct characters make it an arduous task in any language. Earlier efforts on spelling error correction in Bangla and resource-scarce Indic languages focused on rule-based, statistical, and machine learning-based methods which we found rather inefficient. In particular, machine learning-based approaches, which exhibit superior performance to rule-based and statistical methods, are ineffective as they correct each character regardless of its appropriateness. In this work, we propose a novel detector-purificator-corrector framework based on denoising transformers by addressing previous issues. Moreover, we present a method for large-scale corpus creation from scratch which in turn resolves the resource limitation problem of any left-to-right scripted language. The empirical outcomes demonstrate the effectiveness of our approach that outperforms previous state-of-the-art methods by a significant margin for Bangla spelling error correction. The models and corpus are publicly available at https://tinyurl.com/DPCSpell.
翻译:拼写错误校正是确定和纠正文本中拼错的单词的任务,这是自然语言处理中的一个潜在和积极的研究课题,因为人文理解方面的应用很多,因此自然语言处理中是一个潜在和积极的研究课题。音频或视觉上相似但语义上截然不同的字符在任何语言中都是一项艰巨的任务。早些时候在孟加拉语和基于资源拼写错误校正方面所作的努力侧重于我们发现相当无效的基于规则的、统计的和基于机器学习的方法。特别是,基于机器的学习方法,其表现优于基于规则和统计的方法,在纠正每个字符时都是无效的。在这项工作中,我们提出了一个基于解密变压器的新型探测器-净化器-修正框架,其基础是处理以前的问题。此外,我们提出了一个从零开始大规模创建系统的方法,这反过来解决了任何左对右书写语言的资源限制问题。实证结果表明我们的方法的效力,它超越了先前的状态-艺术方法,而不管其适当性如何。在Bangla拼写错误方面有很大的边际差。模型和文体可以公开查阅 http://Csurl。http://DP.porl。