Currently available grammatical error correction (GEC) datasets are compiled using well-formed written text, limiting the applicability of these datasets to other domains such as informal writing and dialog. In this paper, we present a novel parallel GEC dataset drawn from open-domain chatbot conversations; this dataset is, to our knowledge, the first GEC dataset targeted to a conversational setting. To demonstrate the utility of the dataset, we use our annotated data to fine-tune a state-of-the-art GEC model, resulting in a 16 point increase in model precision. This is of particular importance in a GEC model, as model precision is considered more important than recall in GEC tasks since false positives could lead to serious confusion in language learners. We also present a detailed annotation scheme which ranks errors by perceived impact on comprehensibility, making our dataset both reproducible and extensible. Experimental results show the effectiveness of our data in improving GEC model performance in conversational scenario.
翻译:目前可用的语法错误校正(GEC)数据集是用完善的书面文字汇编的,这些数据集的适用性仅限于非正式写作和对话等其他领域。在本文中,我们介绍了从开放的域域域聊天机对话中提取的新型平行的GEC数据集;据我们所知,该数据集是第一个以谈话设置为目标的GEC数据集。为显示数据集的效用,我们使用附加说明的数据微调一个最先进的GEC模型,导致模型精确度增加16个百分点。这在GEC模型中特别重要,因为模型精确度被认为比GEC任务中回顾的要重要,因为假正数可能导致语言学习者严重混乱。我们还提出了一个详细的注释方案,根据对可理解性的影响排列错误,使我们的数据集既可复制又可扩展。实验结果显示我们的数据在改进GEC模型在对话情景中的性能方面的有效性。