The vast majority of evaluation metrics for machine translation are supervised, i.e., (i) assume the existence of reference translations, (ii) are trained on human scores, or (iii) leverage parallel data. This hinders their applicability to cases where such supervision signals are not available. In this work, we develop fully unsupervised evaluation metrics. To do so, we leverage similarities and synergies between evaluation metric induction, parallel corpus mining, and MT systems. In particular, we use an unsupervised evaluation metric to mine pseudo-parallel data, which we use to remap deficient underlying vector spaces (in an iterative manner) and to induce an unsupervised MT system, which then provides pseudo-references as an additional component in the metric. Finally, we also induce unsupervised multilingual sentence embeddings from pseudo-parallel data. We show that our fully unsupervised metrics are effective, i.e., they beat supervised competitors on 4 out of our 5 evaluation datasets.
翻译:绝大多数机器翻译的评价指标都受到监督,即(一) 假设存在参考翻译,(二) 接受人类分数培训,或(三) 利用平行数据,这妨碍了这些指标对没有这种监督信号的情况的适用性。在这项工作中,我们开发了完全不受监督的评价指标。为了这样做,我们利用评价指标上岗、平行物质采矿和MT系统之间的相似性和协同作用。特别是,我们用一种不受监督的评价指标来重新挖掘假平行数据,我们用这些数据(以迭接方式)重新挖掘不足的基本矢量空间,并引出一种不受监督的MT系统,然后作为该指标的附加组成部分提供伪参考。最后,我们还从伪平行数据中引入了不受监督的多语种句。我们表明,我们完全不受监督的计量是有效的,也就是说,他们从我们5个评价数据集中的4个中击败了受监督的竞争者。