Standard automatic metrics (such as BLEU) are problematic for document-level MT evaluation. They can neither distinguish document-level improvements in translation quality from sentence-level ones nor can they identify the specific discourse phenomena that caused the translation errors. To address these problems, we propose an automatic metric BlonD for document-level machine translation evaluation. BlonD takes discourse coherence into consideration by calculating the recall and distance of check-pointing phrases and tags, and further provides comprehensive evaluation scores by combining with n-gram. Extensive comparisons between BlonD and existing evaluation metrics are conducted to illustrate their critical distinctions. Experimental results show that BlonD has a much higher document-level sensitivity with respect to previous metrics. The human evaluation also reveals high Pearson R correlation values between BlonD scores and manual quality judgments.
翻译:标准自动衡量标准(如BLEU)对文件一级的MT评价来说有问题,它们既不能区分翻译质量的文件质量改进与判刑水平的改进,也不能辨别造成翻译错误的具体讨论现象。为了解决这些问题,我们提议为文件一级的机器翻译评价采用自动衡量标准BlonD。BlonD通过计算点字和标记的回调和距离,考虑到讨论的一致性,并通过与n-gg的结合,进一步提供全面的评价分数。BlonD与现有的评价指标进行了广泛的比较,以说明它们的重大区别。实验结果显示,BlonD对以前的指标具有更高的文件敏感性。人类评价还揭示了BlonD分数和人工质量判断之间的高皮尔逊 R 相关值。