Obtaining meaningful quality scores for machine translation systems through human evaluation remains a challenge given the high variability between human evaluators, partly due to subjective expectations for translation quality for different language pairs. We propose a new metric called XSTS that is more focused on semantic equivalence and a cross-lingual calibration method that enables more consistent assessment. We demonstrate the effectiveness of these novel contributions in large scale evaluation studies across up to 14 language pairs, with translation both into and out of English.
翻译:鉴于人类评价人员之间差异很大,通过人力评价人员为机器翻译系统取得有意义的质量评分仍是一项挑战,部分原因是对不同语文对口翻译质量的主观期望。我们建议采用新的标准,即更注重语义等同的XSTS和一种能够进行更一致评估的跨语言校准方法。我们展示了这些新贡献在涉及多达14种语文对口的大规模评价研究中的有效性,包括英文译文和英文译文译文。