As machine translation (MT) metrics improve their correlation with human judgement every year, it is crucial to understand the limitations of such metrics at the segment level. Specifically, it is important to investigate metric behaviour when facing accuracy errors in MT because these can have dangerous consequences in certain contexts (e.g., legal, medical). We curate ACES, a translation accuracy challenge set, consisting of 68 phenomena ranging from simple perturbations at the word/character level to more complex errors based on discourse and real-world knowledge. We use ACES to evaluate a wide range of MT metrics including the submissions to the WMT 2022 metrics shared task and perform several analyses leading to general recommendations for metric developers. We recommend: a) combining metrics with different strengths, b) developing metrics that give more weight to the source and less to surface-level overlap with the reference and c) explicitly modelling additional language-specific information beyond what is available via multilingual embeddings.
翻译:随着机器翻译(MT)指标每年改善与人类判断的关联性,了解这些指标在部门层面的局限性至关重要。具体地说,在面临MT精确误差时,必须调查衡量行为,因为这些误差在某些情况下(例如法律、医学)可能产生危险后果。我们翻译ACES是一个翻译准确性挑战组,由68种现象组成,从文字/字形层次的简单扰动到基于讨论和现实世界知识的更复杂的错误。我们利用ACES来评估广泛的MT指标,包括提交WMT 2022衡量标准的共同任务,并进行若干分析,为衡量开发者提出一般性建议。我们建议:(a) 将衡量标准与不同强项结合起来,(b) 制定衡量标准,使源得到更多重视,减少与参考的表面重叠,(c) 明确模拟通过多语种粘合体提供的更多语言特定语言信息。