Mined bitexts can contain imperfect translations that yield unreliable training signals for Neural Machine Translation (NMT). While filtering such pairs out is known to improve final model quality, we argue that it is suboptimal in low-resource conditions where even mined data can be limited. In our work, we propose instead, to refine the mined bitexts via automatic editing: given a sentence in a language xf, and a possibly imperfect translation of it xe, our model generates a revised version xf' or xe' that yields a more equivalent translation pair (i.e., <xf, xe'> or <xf', xe>). We use a simple editing strategy by (1) mining potentially imperfect translations for each sentence in a given bitext, (2) learning a model to reconstruct the original translations and translate, in a multi-task fashion. Experiments demonstrate that our approach successfully improves the quality of CCMatrix mined bitext for 5 low-resource language-pairs and 10 translation directions by up to ~ 8 BLEU points, in most cases improving upon a competitive back-translation baseline.
翻译:被开采的位元体可以包含不完善的翻译,为神经机器翻译(NMT)产生不可靠的培训信号。 虽然过滤这些配对可以提高最终模型质量,但我们认为,在低资源条件下,即使雷区数据也受到限制,这是不理想的。 在我们的工作中,我们提议通过自动编辑来改进被开采的位元体:用一种语言xf给一个句子,并且可能不完美的翻译 xe,我们的模型产生一个修订版 xf 或 xe,产生一个更等效的翻译配对(即 < xf, xe 或 < xf', xe )。 我们使用简单的编辑战略, (1) 在给定的位数中挖掘每个句子的潜在不完善的翻译, (2) 学习一个模型来重建原始翻译,并以多种方式翻译。 实验表明,我们的方法成功地提高了5种低资源语言版面和10个翻译方向的CMatrix比特的质量, 最高可达~ 8 BLEU 点。