Neural machine translation (NMT) models are conventionally trained with token-level negative log-likelihood (NLL), which does not guarantee that the generated translations will be optimized for a selected sequence-level evaluation metric. Multiple approaches are proposed to train NMT with BLEU as the reward, in order to directly improve the metric. However, it was reported that the gain in BLEU does not translate to real quality improvement, limiting the application in industry. Recently, it became clear to the community that BLEU has a low correlation with human judgment when dealing with state-of-the-art models. This leads to the emerging of model-based evaluation metrics. These new metrics are shown to have a much higher human correlation. In this paper, we investigate whether it is beneficial to optimize NMT models with the state-of-the-art model-based metric, BLEURT. We propose a contrastive-margin loss for fast and stable reward optimization suitable for large NMT models. In experiments, we perform automatic and human evaluations to compare models trained with smoothed BLEU and BLEURT to the baseline models. Results show that the reward optimization with BLEURT is able to increase the metric scores by a large margin, in contrast to limited gain when training with smoothed BLEU. The human evaluation shows that models trained with BLEURT improve adequacy and coverage of translations. Code is available via https://github.com/naver-ai/MetricMT.
翻译:神经机器翻译(NMT)模式通常经过象征性的负面日志类(NLL)标准培训,这并不能保证生成的翻译将最佳地用于选定的序列级评价指标。建议采用多种方法,用BLEU来培训NMT,作为奖励,以直接改进衡量标准。然而,据报告,BLEU的收益并没有转化为真正的质量改进,限制了工业中的应用。最近,社区已经清楚地看到,在处理最新模型时,BLEU与人类判断的关联性较低。这导致了基于模型的评价指标的出现。这些新的指标显示与人的关系要高得多。在本文件中,我们调查以BLEUEU为奖励标准优化NMT模式是否有益,我们建议对适用于大型NMTF模型的快速和稳定的奖励优化进行对比性损失。我们在试验中,进行自动和人文评估,以比较以光滑的BLEU和BLRT为基准模型所训练的模型。结果显示,在经过培训的BLEU/RM的模型中,最优的比值比值比值比值是有限的。