Pretraining-based (PT-based) automatic evaluation metrics (e.g., BERTScore and BARTScore) have been widely used in several sentence generation tasks (e.g., machine translation and text summarization) due to their better correlation with human judgments over traditional overlap-based methods. Although PT-based methods have become the de facto standard for training grammatical error correction (GEC) systems, GEC evaluation still does not benefit from pretrained knowledge. This paper takes the first step towards understanding and improving GEC evaluation with pretraining. We first find that arbitrarily applying PT-based metrics to GEC evaluation brings unsatisfactory correlation results because of the excessive attention to inessential systems outputs (e.g., unchanged parts). To alleviate the limitation, we propose a novel GEC evaluation metric to achieve the best of both worlds, namely PT-M2 which only uses PT-based metrics to score those corrected parts. Experimental results on the CoNLL14 evaluation task show that PT-M2 significantly outperforms existing methods, achieving a new state-of-the-art result of 0.949 Pearson correlation. Further analysis reveals that PT-M2 is robust to evaluate competitive GEC systems. Source code and scripts are freely available at https://github.com/pygongnlp/PT-M2.
翻译:预先培训(基于PT)的自动评价指标(如BERTScore和BARSTScore)由于与人类对传统重叠方法的判断有更好的联系,在几项生成刑期的任务(如机器翻译和文本摘要)中被广泛使用,因为与传统的重叠方法相比,这些评价指标与人类的判断有更好的联系。虽然基于PT的方法已成为培训语法错误纠正系统的实际标准,但全球教育评价仍然没有从事先培训的知识中获益。本文件是了解和改进GEC评价的第一步。我们首先发现,在GEC评价中任意应用基于PT的衡量标准会带来不令人满意的相关结果,因为过度关注必要的系统产出(如机器翻译和文本摘要总结)。为了减轻这种限制,我们提出了一个新的GEC评价指标,以实现两个世界的最佳,即PT-M2,即只使用基于PT的衡量标准来得分。CNLL14评价任务的实验结果显示,PT-M2大大优于现有方法,实现新的状态-艺术评价结果,即0.9MPTPS/MSourg的竞争性分析结果。