Pre-Training (PT) of text representations has been successfully applied to low-resource Neural Machine Translation (NMT). However, it usually fails to achieve notable gains (sometimes, even worse) on resource-rich NMT on par with its Random-Initialization (RI) counterpart. We take the first step to investigate the complementarity between PT and RI in resource-rich scenarios via two probing analyses, and find that: 1) PT improves NOT the accuracy, but the generalization by achieving flatter loss landscapes than that of RI; 2) PT improves NOT the confidence of lexical choice, but the negative diversity by assigning smoother lexical probability distributions than that of RI. Based on these insights, we propose to combine their complementarities with a model fusion algorithm that utilizes optimal transport to align neurons between PT and RI. Experiments on two resource-rich translation benchmarks, WMT'17 English-Chinese (20M) and WMT'19 English-German (36M), show that PT and RI could be nicely complementary to each other, achieving substantial improvements considering both translation accuracy, generalization, and negative diversity. Probing tools and code are released at: https://github.com/zanchangtong/PTvsRI.
翻译:培训前的文字表述方法(PT)成功地应用于低资源神经机能翻译(NMT),但通常无法在资源丰富的NMT与随机启动(RI)对等方相比在资源丰富的NMT上取得显著(有时甚至更糟)的显著进展。我们采取的第一步是调查PT和RI在资源丰富的假设中的互补性,方法是进行两次测试分析,发现:(1) PT没有提高准确性,但通过实现优美的损失景观而不是RI(NMT)而普遍化;(2) PT和RI提高了词汇选择的信心,但通过分配比RI的更平稳的词汇概率分布而提高了负面的多样性。基于这些见解,我们提议将其互补性与模型集成算法结合起来,利用最佳运输法使PT和RI之间的神经系统保持一致。 对资源丰富的两个翻译基准WMT'17英语-中文(20M)和WMT'19英语-德语(36M)的实验表明,PT和RI可以很好地互相补充,考虑到翻译的准确性、一般化和负式的版本/移动工具。