Synthetic data construction of Grammatical Error Correction (GEC) for non-English languages relies heavily on human-designed and language-specific rules, which produce limited error-corrected patterns. In this paper, we propose a generic and language-independent strategy for multilingual GEC, which can train a GEC system effectively for a new non-English language with only two easy-to-access resources: 1) a pretrained cross-lingual language model (PXLM) and 2) parallel translation data between English and the language. Our approach creates diverse parallel GEC data without any language-specific operations by taking the non-autoregressive translation generated by PXLM and the gold translation as error-corrected sentence pairs. Then, we reuse PXLM to initialize the GEC model and pretrain it with the synthetic data generated by itself, which yields further improvement. We evaluate our approach on three public benchmarks of GEC in different languages. It achieves the state-of-the-art results on the NLPCC 2018 Task 2 dataset (Chinese) and obtains competitive performance on Falko-Merlin (German) and RULEC-GEC (Russian). Further analysis demonstrates that our data construction method is complementary to rule-based approaches.
翻译:在本文件中,我们提出了多语制全球教育中心通用和语言独立战略,为新的非英语语言有效培训全球教育中心系统,只有两种容易获取的资源:(1) 预先培训的跨语言语言模式(PXLM)和(2) 英语和语言之间的平行翻译数据。我们的方法通过将PXLM和黄金翻译产生的非视觉性翻译作为错误校正的对子,在没有任何语言操作的情况下创建了不同的平行全球教育中心数据。然后,我们重新使用PXLM, 以启动全球教育中心模式,并用自己生成的合成数据对其进行预设,从而取得进一步改进。我们用不同语言评估全球教育中心三个公共基准的方法。它实现了基于NLPCC 2018任务2数据集(中国)的最新成果,并获得了Falko-Merlin(德国)和黄金翻译作为错误校正法-俄罗斯规则(俄罗斯规则)的进一步数据分析。