Although proper handling of discourse phenomena significantly contributes to the quality of machine translation (MT), common translation quality metrics do not adequately capture them. Recent works in context-aware MT attempt to target a small set of these phenomena during evaluation. In this paper, we propose a new metric, P-CXMI, which allows us to identify translations that require context systematically and confirm the difficulty of previously studied phenomena as well as uncover new ones that have not been addressed in previous work. We then develop the Multilingual Discourse-Aware (MuDA) benchmark, a series of taggers for these phenomena in 14 different language pairs, which we use to evaluate context-aware MT. We find that state-of-the-art context-aware MT models find marginal improvements over context-agnostic models on our benchmark, which suggests current models do not handle these ambiguities effectively. We release code and data to invite the MT research community to increase efforts on context-aware translation on discourse phenomena and languages that are currently overlooked.
翻译:虽然妥善处理讨论现象对机器翻译的质量有重大贡献,但通用翻译质量指标并不能充分捕捉到这些现象。最近对背景有认识的MT工程试图在评估期间针对其中的一小部分现象。在本文件中,我们提出了一个新的指标P-CXMI, 使我们能够系统地确定需要背景的翻译,并证实以前研究过的现象的困难,以及发现以前工作中没有处理过的新的现象。然后,我们用14种不同的语言为这些现象制定多种语言的Discoin-Aware(MuDA)基准,这是我们用来评估背景有觉识的MT系列标签。我们发现,最先进的环境有觉识的MT模型比我们基准上的背景无识模型发现了边际的改进,这表明目前的模型没有有效地处理这些模糊之处。我们发布代码和数据,请MT研究界就目前被忽视的语种和语种的语种,加大对背景有觉的翻译工作。