This paper introduces a new data augmentation method for neural machine translation that can enforce stronger semantic consistency both within and across languages. Our method is based on Conditional Masked Language Model (CMLM) which is bi-directional and can be conditional on both left and right context, as well as the label. We demonstrate that CMLM is a good technique for generating context-dependent word distributions. In particular, we show that CMLM is capable of enforcing semantic consistency by conditioning on both source and target during substitution. In addition, to enhance diversity, we incorporate the idea of soft word substitution for data augmentation which replaces a word with a probabilistic distribution over the vocabulary. Experiments on four translation datasets of different scales show that the overall solution results in more realistic data augmentation and better translation quality. Our approach consistently achieves the best performance in comparison with strong and recent works and yields improvements of up to 1.90 BLEU points over the baseline.
翻译:本文为神经机翻译引入了一种新的数据增强方法,该方法可以加强语言内部和语言之间的语义一致性。我们的方法基于双向、可同时以左和右环境以及标签为条件的有条件的蒙面语言模型(CMLM ) 。我们证明,CMLM 是产生基于上下文的文字分布的好方法。特别是,我们表明,CMLM 能够通过在替代期间对源和目标加以限制,实现语义一致性。此外,为了增强多样性,我们纳入了数据增强的软字替换概念,用词汇上的概率分布取代一个单词。对四个不同尺度的翻译数据集的实验表明,总体解决方案能够更现实地增强数据,提高翻译质量。我们的方法始终与强力和最新工程相比,取得最佳的性能,并在基线上改进了1.90 BLEU点。