Trainable evaluation metrics for machine translation (MT) exhibit strong correlation with human judgements, but they are often hard to interpret and might produce unreliable scores under noisy or out-of-domain data. Recent work has attempted to mitigate this with simple uncertainty quantification techniques (Monte Carlo dropout and deep ensembles), however these techniques (as we show) are limited in several ways -- for example, they are unable to distinguish between different kinds of uncertainty, and they are time and memory consuming. In this paper, we propose more powerful and efficient uncertainty predictors for MT evaluation, and we assess their ability to target different sources of aleatoric and epistemic uncertainty. To this end, we develop and compare training objectives for the COMET metric to enhance it with an uncertainty prediction output, including heteroscedastic regression, divergence minimization, and direct uncertainty prediction. Our experiments show improved results on uncertainty prediction for the WMT metrics task datasets, with a substantial reduction in computational costs. Moreover, they demonstrate the ability of these predictors to address specific uncertainty causes in MT evaluation, such as low quality references and out-of-domain data.
翻译:机械翻译(MT)的可培训评价指标与人类判断密切相关,但往往很难解释,在吵闹或外表数据下可能产生不可靠的分数。最近的工作试图用简单的不确定性量化技术(蒙特卡洛辍学和深层集合)来缓解这一点,但这些技术(如我们所显示的)在若干方面是有限的 -- -- 例如,它们无法区分不同种类的不确定性,而且它们耗费时间和记忆。在本文件中,我们为MT评价提出了更强大和高效的不确定性预测,我们评估它们是否有能力针对不同来源的偏执和集中不确定性。为此,我们制定并比较了知识与技术的各项指标的培训目标,以便用不确定性预测产出,包括混凝土回归、差异最小化和直接不确定性预测。我们的实验显示,WMT指标任务数据集的不确定性预测结果有所改善,计算成本也大幅降低。此外,这些预测员有能力在MT评价中解决具体的不确定性原因,例如低质量引用和外部数据。