The high-quality translation results produced by machine translation (MT) systems still pose a huge challenge for automatic evaluation. Current MT evaluation pays the same attention to each sentence component, while the questions of real-world examinations (e.g., university examinations) have different difficulties and weightings. In this paper, we propose a novel difficulty-aware MT evaluation metric, expanding the evaluation dimension by taking translation difficulty into consideration. A translation that fails to be predicted by most MT systems will be treated as a difficult one and assigned a large weight in the final score function, and conversely. Experimental results on the WMT19 English-German Metrics shared tasks show that our proposed method outperforms commonly used MT metrics in terms of human correlation. In particular, our proposed method performs well even when all the MT systems are very competitive, which is when most existing metrics fail to distinguish between them. The source code is freely available at https://github.com/NLP2CT/Difficulty-Aware-MT-Evaluation.
翻译:机器翻译(MT)系统产生的高质量翻译结果仍对自动评价构成巨大挑战。目前的MT评价对每个句子部分给予同样的重视,而现实世界考试(例如大学考试)的问题则有不同的困难和加权。在本文件中,我们提出一个新的难懂的MT评价指标,通过考虑翻译困难来扩大评价层面。大多数MT系统无法预测的翻译将被视为困难的翻译,在最后分数函数中给予很大分数,反之亦然。WMT19英文-德文的实验结果显示,我们所提议的方法在人际关系方面优于常用的MT衡量标准。特别是,即使所有MT系统都非常具有竞争力,也就是大多数现有指标无法区分它们时,我们提议的方法也表现良好。源代码可免费查阅https://github.com/NLP2CT/Difficolty-Aware-MT-Evale。