机器翻译的外部评估 (Extrinsic Evaluation of Machine Translation Metrics)

Automatic machine translation (MT) metrics are widely used to distinguish the translation qualities of machine translation systems across relatively large test sets (system-level evaluation). However, it is unclear if automatic metrics are reliable at distinguishing good translations from bad translations at the sentence level (segment-level evaluation). In this paper, we investigate how useful MT metrics are at detecting the success of a machine translation component when placed in a larger platform with a downstream task. We evaluate the segment-level performance of the most widely used MT metrics (chrF, COMET, BERTScore, etc.) on three downstream cross-lingual tasks (dialogue state tracking, question answering, and semantic parsing). For each task, we only have access to a monolingual task-specific model. We calculate the correlation between the metric's ability to predict a good/bad translation with the success/failure on the final task for the Translate-Test setup. Our experiments demonstrate that all metrics exhibit negligible correlation with the extrinsic evaluation of the downstream outcomes. We also find that the scores provided by neural metrics are not interpretable mostly because of undefined ranges. Our analysis suggests that future MT metrics be designed to produce error labels rather than scores to facilitate extrinsic evaluation.

翻译：自动机器翻译(MT)衡量标准被广泛用来区分机器翻译系统在相对较大测试组(系统级评价)的翻译质量,然而,尚不清楚自动衡量标准是否可靠,在句级(部分级评价)将好译文与差译文区分开来(部分级评价)。在本文件中,我们调查在将机器翻译部分置于一个具有下游任务的更大平台时,MT衡量标准在发现一个机器翻译部分成功与否方面是否有用处。我们评估了最广泛使用的MT衡量标准(chrF、CWET、BERTScore等)在三种下游跨语言任务(对话状态跟踪、问题回答和语义解析)的分级性表现。对于每一项任务,我们只能使用单一的单语级任务具体模型。我们计算该衡量标准在预测好/差译文的能力与最终任务设置翻译成功/失败之间的关联性关系。我们的实验表明,所有最广泛使用的MTT指标(chrins asserview)在下游结果的外端评价(对话状态跟踪、问题解答、解答和语义性解分数分析)方面都无法解释,因为我们的评分数是用来分析。

相关内容

Machine Translation

关注 210

机器翻译（Machine Translation）涵盖计算语言学和语言工程的所有分支，包含多语言方面。特色论文涵盖理论，描述或计算方面的任何下列主题:双语和多语语料库的编写和使用，计算机辅助语言教学，非罗马字符集的计算含义，连接主义翻译方法，对比语言学等。官网地址：http://dblp.uni-trier.de/db/journals/mt/

不可错过！《机器学习100讲》课程，UBC Mark Schmidt讲授

专知会员服务

75+阅读 · 2022年6月28日

多标签学习的新趋势（2020 Survey）

专知会员服务

44+阅读 · 2020年12月6日

【视频描述综述论文】Video Description: A Survey of Methods, Datasets, and Evaluation Metrics

专知会员服务

65+阅读 · 2020年5月12日

100+篇《自监督学习(Self-Supervised Learning)》论文最新合集

专知会员服务

166+阅读 · 2020年3月18日