Grammar Error Correction(GEC) mainly relies on the availability of high quality of large amount of synthetic parallel data of grammatically correct and erroneous sentence pairs. The quality of the synthetic data is evaluated on how well the GEC system performs when pre-trained using it. But this does not provide much insight into what are the necessary factors which define the quality of these data. So this work aims to introduce 3 metrics - reliability, diversity and distribution match to provide more insight into the quality of large-scale synthetic data generated for the GEC task, as well as automatically evaluate them. Evaluating these three metrics automatically can also help in providing feedback to the data generation systems and thereby improve the quality of the synthetic data generated dynamically
翻译:语法错误校正(GEC)主要取决于能否获得大量高品质的合成平行数据,这些数据的语法正确和错误的对词配对的合成平行数据。对合成数据的质量进行了评估,以确定GEC系统在培训前使用GEC系统时的运行情况。但是,这并不能对确定这些数据质量的必要因素提供多少了解。因此,这项工作的目的是引入3个指标----可靠性、多样性和分布匹配,以便更深入地了解为GEC任务生成的大规模合成数据的质量,并自动评估这些数据。评估这3个指标还能够自动帮助向数据生成系统提供反馈,从而动态地提高合成数据的质量。