Language coverage bias, which indicates the content-dependent differences between sentence pairs originating from the source and target languages, is important for neural machine translation (NMT) because the target-original training data is not well exploited in current practice. By carefully designing experiments, we provide comprehensive analyses of the language coverage bias in the training data, and find that using only the source-original data achieves comparable performance with using full training data. Based on these observations, we further propose two simple and effective approaches to alleviate the language coverage bias problem through explicitly distinguishing between the source- and target-original training data, which consistently improve the performance over strong baselines on six WMT20 translation tasks. Complementary to the translationese effect, language coverage bias provides another explanation for the performance drop caused by back-translation. We also apply our approach to both back- and forward-translation and find that mitigating the language coverage bias can improve the performance of both the two representative data augmentation methods and their tagged variants.
翻译:语言覆盖面偏差表明,源于源与目标语言的判刑对口之间在内容上的差异,这对神经机能翻译十分重要,因为当前做法没有很好地利用目标原始培训数据。我们通过仔细设计实验,对培训数据中的语言覆盖面偏差进行了全面分析,发现仅使用源数据就能用全部培训数据取得可比较的性能。根据这些观察,我们进一步提出两个简单有效的办法,通过明确区分源与目标原始培训数据来缓解语言覆盖面问题,这些数据不断改善六个WMT20翻译任务强基线的性能。除了翻译效果外,语言覆盖面偏差还解释了反译导致的性能下降。我们还运用了我们的方法,既使用源数据,又使用全面培训数据数据转换,发现减少语言覆盖面偏差可以改善两种有代表性的数据增强方法及其加标变量的性能。