The vast majority of evaluation metrics for machine translation are supervised, i.e., (i) are trained on human scores, (ii) assume the existence of reference translations, or (iii) leverage parallel data. This hinders their applicability to cases where such supervision signals are not available. In this work, we develop fully unsupervised evaluation metrics. To do so, we leverage similarities and synergies between evaluation metric induction, parallel corpus mining, and MT systems. In particular, we use an unsupervised evaluation metric to mine pseudo-parallel data, which we use to remap deficient underlying vector spaces (in an iterative manner) and to induce an unsupervised MT system, which then provides pseudo-references as an additional component in the metric. Finally, we also induce unsupervised multilingual sentence embeddings from pseudo-parallel data. We show that our fully unsupervised metrics are effective, i.e., they beat supervised competitors on 4 out of our 5 evaluation datasets. We make our code publicly available.
翻译:绝大多数机器翻译的评价指标都受到监督,即:(一) 接受人类分数培训,(二) 假设存在参考翻译,或(三) 利用平行数据,这妨碍了这些指标对没有这种监督信号的情况的适用性。在这项工作中,我们开发了完全不受监督的评价指标,为此,我们利用评价指标上岗、平行物质采矿和MT系统之间的相似性和协同作用。特别是,我们用一种不受监督的评价指标来评估假平行数据,我们用来(以迭接方式)重新挖掘不足的矢量源空间,并引出一种不受监督的MT系统,作为衡量标准的一个额外组成部分,提供伪参考。最后,我们还从伪平行数据中引入不受监督的多语种句。我们表明,我们完全不受监督的衡量标准是有效的,也就是说,我们用5个评价数据集中的4个来击败受监督的竞争者。我们公布了我们的代码。