While machine translation evaluation metrics based on string overlap (e.g., BLEU) have their limitations, their computations are transparent: the BLEU score assigned to a particular candidate translation can be traced back to the presence or absence of certain words. The operations of newer learned metrics (e.g., BLEURT, COMET), which leverage pretrained language models to achieve higher correlations with human quality judgments than BLEU, are opaque in comparison. In this paper, we shed light on the behavior of these learned metrics by creating DEMETR, a diagnostic dataset with 31K English examples (translated from 10 source languages) for evaluating the sensitivity of MT evaluation metrics to 35 different linguistic perturbations spanning semantic, syntactic, and morphological error categories. All perturbations were carefully designed to form minimal pairs with the actual translation (i.e., differ in only one aspect). We find that learned metrics perform substantially better than string-based metrics on DEMETR. Additionally, learned metrics differ in their sensitivity to various phenomena (e.g., BERTScore is sensitive to untranslated words but relatively insensitive to gender manipulation, while COMET is much more sensitive to word repetition than to aspectual changes). We publicly release DEMETR to spur more informed future development of machine translation evaluation metrics
翻译:虽然基于字符串重叠(如BLEU)的机器翻译评价衡量标准有其局限性,但其计算是透明的:分配给特定候选人翻译的BLEU评分可以追溯到是否存在某些单词。较新学的衡量标准(如BLEURT、CONT)运用经过预先训练的语言模型实现与人的质量判断之间比BLEU更高的相关性,但比较起来不透明。在本文中,我们通过创建DEMETR(由10种来源语言翻译的31K英文实例组成的诊断数据集)来说明这些所学的衡量标准的行为。DEMETR是用于评价MT评价指标对35种不同语言扰动的敏感度的诊断性数据集(从10种来源语言中译出),可以追溯到某些包含语义、合成和形态错误类别的35种不同语言扰动性指标(如BLEURT、CETT)的操作。所有扰动都经过仔细设计,以形成与实际翻译相比最小的配对(即只在一个方面有所不同)。我们发现,所学的衡量标准比DMETR的基于字符串的衡量指标的衡量得好得多。此外,在对各种现象的敏感度度指标的敏感度上,但在各种现象的敏感度上,但相对敏感度则比我们更敏感地对更敏感地进行着对更敏感地对于对的标准化的反复到对的反复到对。