Chinese Grammatical Error Correction (CGEC) is both a challenging NLP task and a common application in human daily life. Recently, many data-driven approaches are proposed for the development of CGEC research. However, there are two major limitations in the CGEC field: First, the lack of high-quality annotated training corpora prevents the performance of existing CGEC models from being significantly improved. Second, the grammatical errors in widely used test sets are not made by native Chinese speakers, resulting in a significant gap between the CGEC models and the real application. In this paper, we propose a linguistic rules-based approach to construct large-scale CGEC training corpora with automatically generated grammatical errors. Additionally, we present a challenging CGEC benchmark derived entirely from errors made by native Chinese speakers in real-world scenarios. Extensive experiments and detailed analyses not only demonstrate that the training data constructed by our method effectively improves the performance of CGEC models, but also reflect that our benchmark is an excellent resource for further development of the CGEC field.
翻译:中国典型错误校正(CGEC)既是一项艰巨的任务,也是人类日常生活中的一种常见应用。最近,提出了许多以数据为驱动的办法来发展个体分类研究。然而,在个体分类领域存在两大限制:第一,缺乏高质量的附加说明的培训公司,使得现有个体分类模型的性能无法大大改进。第二,广用测试组中的语法错误不是由本地中文演讲人造成的,造成个体分类模型与实际应用之间的巨大差距。在本文件中,我们提出了一种基于语言的基于规则的方法,用自动生成的语法错误来构建大型个体分类公司的培训。此外,我们提出了一个具有挑战性的个体分类中心基准,完全源于本地中文演讲人在现实世界情景中所犯的错误。广泛的实验和详细分析不仅表明我们的方法构建的培训数据有效地改进了个体分类模型的性能,还反映出我们的基准是进一步开发个体分类领域的最佳资源。