Research on Korean grammatical error correction (GEC) is limited compared to other major languages such as English and Chinese. We attribute this problematic circumstance to the lack of a carefully designed evaluation benchmark for Korean. Thus, in this work, we first collect three datasets from different sources (Kor-Lang8, Kor-Native, and Kor-Learner) to cover a wide range of error types and annotate them using our newly proposed tool called Korean Automatic Grammatical error Annotation System (KAGAS). KAGAS is a carefully designed edit alignment & classification tool that considers the nature of Korean on generating an alignment between a source sentence and a target sentence, and identifies error types on each aligned edit. We also present baseline models fine-tuned over our datasets. We show that the model trained with our datasets significantly outperforms the public statistical GEC system (Hanspell) on a wider range of error types, demonstrating the diversity and usefulness of the datasets.
翻译:韩国语系错误校正(GEC)的研究与英语和中文等其他主要语言相比是有限的,我们将此问题归因于韩国语缺乏精心设计的评估基准。因此,在这项工作中,我们首先从不同来源(Kor-Lang8、Kor-Native和Kor-Learner)收集了三个数据集,以涵盖广泛的误差类型,并使用我们新提议的名为韩国自动格外误差说明系统(KAGAS)的工具对其进行说明。KAGAS是一个精心设计的编辑校对和分类工具,该工具考虑到韩国语在生成源句和目标句对齐方面的性质,并确定了每个对齐编辑中的误差类型。我们还介绍了对数据集进行精细调整的基线模型。我们显示,用我们的数据集培训的模型在更广泛的误差类型上大大优化了公共统计GEC系统(Hanspell),显示了数据集的多样性和有用性。