Automatic machine translation (MT) metrics are widely used to distinguish the translation qualities of machine translation systems across relatively large test sets (system-level evaluation). However, it is unclear if automatic metrics are reliable at distinguishing good translations from bad translations at the sentence level (segment-level evaluation). In this paper, we investigate how useful MT metrics are at detecting the success of a machine translation component when placed in a larger platform with a downstream task. We evaluate the segment-level performance of the most widely used MT metrics (chrF, COMET, BERTScore, etc.) on three downstream cross-lingual tasks (dialogue state tracking, question answering, and semantic parsing). For each task, we only have access to a monolingual task-specific model. We calculate the correlation between the metric's ability to predict a good/bad translation with the success/failure on the final task for the Translate-Test setup. Our experiments demonstrate that all metrics exhibit negligible correlation with the extrinsic evaluation of the downstream outcomes. We also find that the scores provided by neural metrics are not interpretable mostly because of undefined ranges. Our analysis suggests that future MT metrics be designed to produce error labels rather than scores to facilitate extrinsic evaluation.
翻译:自动机器翻译(MT)衡量标准被广泛用来区分机器翻译系统在相对较大测试组(系统级评价)的翻译质量,然而,尚不清楚自动衡量标准是否可靠,在句级(部分级评价)将好译文与差译文区分开来(部分级评价)。在本文件中,我们调查在将机器翻译部分置于一个具有下游任务的更大平台时,MT衡量标准在发现一个机器翻译部分成功与否方面是否有用处。我们评估了最广泛使用的MT衡量标准(chrF、CWET、BERTScore等)在三种下游跨语言任务(对话状态跟踪、问题回答和语义解析)的分级性表现。对于每一项任务,我们只能使用单一的单语级任务具体模型。我们计算该衡量标准在预测好/差译文的能力与最终任务设置翻译成功/失败之间的关联性关系。我们的实验表明,所有最广泛使用的MTT指标(chrins asserview)在下游结果的外端评价(对话状态跟踪、问题解答、解答和语义性解分数分析)方面都无法解释,因为我们的评分数是用来分析。