Chinese Spelling Correction (CSC) is a task to detect and correct spelling mistakes in texts. In fact, most of Chinese input is based on pinyin input method, so the study of spelling errors in this process is more practical and valuable. However, there is still no research dedicated to this essential scenario. In this paper, we first present a Chinese Spelling Correction Dataset for errors generated by pinyin IME (CSCD-IME), including 40,000 annotated sentences from real posts of official media on Sina Weibo. Furthermore, we propose a novel method to automatically construct large-scale and high-quality pseudo data by simulating the input through pinyin IME. A series of analyses and experiments on CSCD-IME show that spelling errors produced by pinyin IME hold a particular distribution at pinyin level and semantic level and are challenging enough. Meanwhile, our proposed pseudo-data construction method can better fit this error distribution and improve the performance of CSC systems. Finally, we provide a useful guide to using pseudo data, including the data scale, the data source, and the training strategy.
翻译:中文拼写校正(CSC)是发现和纠正文本拼写错误的任务,事实上,中国输入的大部分内容都是基于针线输入法,因此,在这一过程中对拼写错误的研究更加实用和宝贵。然而,对于这一基本情景,还没有专门的研究。在本文中,我们首先为针线 IME (CCD-IME) 产生的错误提供中国拼写校正数据集,其中包括40,000个在Sina Weibo 官方媒体真实文章上附加注释的句子。此外,我们提出了一个新颖的方法,通过模拟针线 IME 的输入自动构建大规模和高质量的伪数据。 CSCD-IME 的一系列分析和实验显示, Pinyin IME 生成的拼写错误在pin 级别和 semantic 级别上都有特定的分布,而且具有足够挑战性。 同时,我们提议的伪数据构建方法可以更好地适应这种错误的分布,并改进 CSC 系统的性能。最后,我们为使用伪数据提供了有用的指南,包括数据规模、数据源和培训战略。