Neural machine translation models are often biased toward the limited translation references seen during training. To amend this form of overfitting, in this paper we propose fine-tuning the models with a novel training objective based on the recently-proposed BERTScore evaluation metric. BERTScore is a scoring function based on contextual embeddings that overcomes the typical limitations of n-gram-based metrics (e.g. synonyms, paraphrases), allowing translations that are different from the references, yet close in the contextual embedding space, to be treated as substantially correct. To be able to use BERTScore as a training objective, we propose three approaches for generating soft predictions, allowing the network to remain completely differentiable end-to-end. Experiments carried out over four, diverse language pairs have achieved improvements of up to 0.58 pp (3.28%) in BLEU score and up to 0.76 pp (0.98%) in BERTScore (F_BERT) when fine-tuning a strong baseline.
翻译:神经机器翻译模型往往偏向于培训期间所看到的有限的翻译参考资料。 修改这种过于完善的形式,我们在本文件中提议根据最近提出的BERTScore评价指标,以新的培训目标对模型进行微调。 BERTScore是一个基于背景嵌入的评分功能,它克服了基于ngram的衡量标准(例如同义词、副词句等)的典型限制,允许与参考资料不同的译文,但在背景嵌入空间中却很接近的译文被视为基本正确。为了能够将BERTScore作为培训目标,我们建议了三种方法来生成软预测,使网络能够保持完全不同的最终到尾端。在对四种语言的实验中,在BLEU评分中实现了最多0.58 pp(3.28% ) 的改进,而在BERTScore (F_BERT) 的精确基线中实现了0.76 0.98%的改进。