利用有标签的腐败模式校正明显错误的合成数据生成 (Synthetic Data Generation for Grammatical Error Correction with Tagged Corruption Models)

from arxiv, Proceedings of the 16th Workshop on Innovative Use of NLP for Building Educational Applications, 2021. https://github.com/google-research-datasets/C4_200M-synthetic-dataset-for-grammatical-error-correction

Synthetic data generation is widely known to boost the accuracy of neural grammatical error correction (GEC) systems, but existing methods often lack diversity or are too simplistic to generate the broad range of grammatical errors made by human writers. In this work, we use error type tags from automatic annotation tools such as ERRANT to guide synthetic data generation. We compare several models that can produce an ungrammatical sentence given a clean sentence and an error type tag. We use these models to build a new, large synthetic pre-training data set with error tag frequency distributions matching a given development set. Our synthetic data set yields large and consistent gains, improving the state-of-the-art on the BEA-19 and CoNLL-14 test sets. We also show that our approach is particularly effective in adapting a GEC system, trained on mixed native and non-native English, to a native English test set, even surpassing real training data consisting of high-quality sentence pairs.

翻译：合成数据生成广为人知,可以提高神经语法错误校正系统(GEC)的准确性,但现有方法往往缺乏多样性,或过于简单,无法产生人类作家造成的广泛的语法错误。在这项工作中,我们使用诸如ERRANT等自动注解工具的错误类型标记来指导合成数据生成。我们比较了能够产生非语法句的几种模型,给出了干净的句子和差错类型标记。我们用这些模型来建立一个新的大型培训前合成数据集,配有与特定开发数据集相匹配的错误标签频率分布。我们的合成数据集产生大量一致的成果,改善了BEA-19和CONLLL-14测试集上的最新工艺。我们还表明,我们的方法特别有效,将GEC系统(受过本地和非本地混合英语培训)与本地英语测试集相适应,甚至超过了由高质量词组组成的真正培训数据。

相关内容

MoDELS

关注 43

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/