There has been an increased interest in data generation approaches to grammatical error correction (GEC) using pseudo data. However, these approaches suffer from several issues that make them inconvenient for real-world deployment including a demand for large amounts of training data. On the other hand, some errors based on grammatical rules may not necessarily require a large amount of data if GEC models can realize grammatical generalization. This study explores to what extent GEC models generalize grammatical knowledge required for correcting errors. We introduce an analysis method using synthetic and real GEC datasets with controlled vocabularies to evaluate whether models can generalize to unseen errors. We found that a current standard Transformer-based GEC model fails to realize grammatical generalization even in simple settings with limited vocabulary and syntax, suggesting that it lacks the generalization ability required to correct errors from provided training examples.
翻译:人们对利用假数据进行语法错误校正(GEC)的数据收集方法越来越感兴趣,但是,这些方法存在若干问题,使这些方法难以用于实际部署,包括需要大量培训数据,另一方面,如果GEC模型能够实现语法概括化,基于语法规则的一些错误不一定需要大量数据。本研究探讨了GEC模型在多大程度上将纠正错误所需的语法知识普遍化。我们采用了一种分析方法,使用有受控词汇的合成和真实的GEC数据集来评估模型能否概括为看不见的错误。我们发现,目前标准的GEC变异器模型即使在词汇和语法有限的简单环境中也无法实现语法化的语法概括化,这表明它缺乏纠正从所提供的培训实例中错误所需的一般化能力。