通用和主域适应中文拼写检查 (General and Domain Adaptive Chinese Spelling Check with Error Consistent Pretraining)

The lack of label data is one of the significant bottlenecks for Chinese Spelling Check (CSC). Existing researches use the method of automatic generation by exploiting unlabeled data to expand the supervised corpus. However, there is a big gap between the real input scenario and automatic generated corpus. Thus, we develop a competitive general speller ECSpell which adopts the Error Consistent masking strategy to create data for pretraining. This error consistency masking strategy is used to specify the error types of automatically generated sentences which is consistent with real scene. The experimental result indicates our model outperforms previous state-of-the-art models on the general benchmark. Moreover, spellers often work within a particular domain in real life. Due to lots of uncommon domain terms, experiments on our built domain specific datasets show that general models perform terribly. Inspired by the common practice of input methods, we propose to add an alterable user dictionary to handle the zero-shot domain adaption problem. Specifically, we attach a User Dictionary guided inference module (UD) to a general token classification based speller. Our experiments demonstrate that ECSpell$^{UD}$, namely ECSpell combined with UD, surpasses all the other baselines largely, even approaching the performance on the general benchmark.

翻译：缺少标签数据是中国拼写检查(CSC)的重要瓶颈之一。现有的研究使用自动生成的方法,利用未贴标签的数据来扩大受监督的功能。然而,在实际输入情景和自动生成的功能之间存在巨大差距。因此,我们开发了一个具有竞争力的通用拼写器ECSpell,采用“错误一致”掩码策略为预培训创建数据。这种错误一致性掩码策略用于指定自动生成的句子的错误类型,与真实场景相一致。实验结果表明,我们的模型比一般基准上以前的最先进的模型要好。此外,拼写器往往在现实生活中某个特定领域工作。由于许多不常见的域名词,我们建筑的域特定数据集实验显示,通用模型效果极差。在输入方法的常见做法的激励下,我们提议添加一个可修改的用户词典,以处理零光域适应问题。具体地说,我们将用户导导导导导导导模块(UD)附在基于一般符号的拼写器分类中。我们的实验显示,甚至连环超标准基准,即ECSpell$=GUDUD。