State-of-the-art machine translation evaluation metrics are based on black-box language models. Hence, recent works consider their explainability with the goals of better understandability for humans and better metric analysis, including failure cases. In contrast, we explicitly leverage explanations to boost the metrics' performance. In particular, we perceive explanations as word-level scores, which we convert, via power means, into sentence-level scores. We combine this sentence-level score with the original metric to obtain a better metric. Our extensive evaluation and analysis across 5 datasets, 5 metrics and 4 explainability techniques shows that some configurations reliably improve the original metrics' correlation with human judgment. On two held datasets for testing, we obtain improvements in 15/18 resp. 4/4 cases. The gains in Pearson correlation are up to 0.032 resp. 0.055. We make our code available.
翻译:最先进的机器翻译评价指标以黑盒语言模型为基础。 因此,最近的工作考虑了这些指标的解释性,目的是更好地理解人类,进行更好的衡量分析,包括失败案例。相反,我们明确地利用解释来提高衡量标准的业绩。我们尤其把解释性看成是字级分数,我们通过权力手段将其转换为判决等级分数。我们把这一句级分数与原有的衡量标准结合起来,以获得更好的衡量标准。我们在5个数据集、5个计量和4个解释性技术中进行的广泛评价和分析表明,有些配置可靠地改善了原始衡量标准与人类判断的相互关系。在两个持有的用于测试的数据集中,我们在15/18 resp. 4/4中得到了改进。Pearson 相关得益达到0.032 resp. 0.055。我们提供了我们的代码。