In Neural Machine Translation (NMT), data augmentation methods such as back-translation have proven their effectiveness in improving translation performance. In this paper, we propose a novel data augmentation approach for NMT, which is independent of any additional training data. Our approach, AdMix, consists of two parts: 1) introduce faint discrete noise (word replacement, word dropping, word swapping) into the original sentence pairs to form augmented samples; 2) generate new synthetic training data by softly mixing the augmented samples with their original samples in training corpus. Experiments on three translation datasets of different scales show that AdMix achieves signifi cant improvements (1.0 to 2.7 BLEU points) over strong Transformer baseline. When combined with other data augmentation techniques (e.g., back-translation), our approach can obtain further improvements.
翻译:在神经机器翻译(NMT)中,回译等数据扩增方法已证明了它们在提高翻译性能方面的有效性。在本文中,我们提议对NMT采用新的数据扩增方法,该方法独立于任何其他培训数据。我们的方法AdMix由两部分组成:1)将微离散噪音(字替换、字滴、换字)引入原句对以形成增强样品;2)通过将增扩样品与其在培训系统中的原始样品软化地混合而生成新的合成培训数据。对不同规模的三个翻译数据集的实验显示,AdMix在强大的变换基线之上实现了信号罐改进(1.0至2.7 BLEU点 ) 。当与其他数据扩增技术(如回译)相结合时,我们的方法可以得到进一步的改进。