Self-training has proven effective for improving NMT performance by augmenting model training with synthetic parallel data. The common practice is to construct synthetic data based on a randomly sampled subset of large-scale monolingual data, which we empirically show is sub-optimal. In this work, we propose to improve the sampling procedure by selecting the most informative monolingual sentences to complement the parallel data. To this end, we compute the uncertainty of monolingual sentences using the bilingual dictionary extracted from the parallel data. Intuitively, monolingual sentences with lower uncertainty generally correspond to easy-to-translate patterns which may not provide additional gains. Accordingly, we design an uncertainty-based sampling strategy to efficiently exploit the monolingual data for self-training, in which monolingual sentences with higher uncertainty would be sampled with higher probability. Experimental results on large-scale WMT English$\Rightarrow$German and English$\Rightarrow$Chinese datasets demonstrate the effectiveness of the proposed approach. Extensive analyses suggest that emphasizing the learning on uncertain monolingual sentences by our approach does improve the translation quality of high-uncertainty sentences and also benefits the prediction of low-frequency words at the target side.
翻译:实践是,根据随机抽样的大规模单一语言数据分组构建合成数据,我们从经验中发现这些数据是次优的。在这项工作中,我们提议通过选择信息最丰富的单语句来改进抽样程序,以补充平行数据。为此,我们用从平行数据中提取的双语字典计算单语句的不确定性。直觉地计算出单语句的不确定性,低不确定性的单语句通常与容易翻译的模式相对应,而这种模式可能不会带来额外收益。因此,我们设计基于不确定性的抽样战略,有效利用单语语数据进行自我培训,在其中,对单语句的不确定性较高句子将进行抽样,概率更高。大规模英语和英语的实验结果显示了拟议方法的有效性。广泛的分析表明,通过我们的方法强调对不确定的单语句的学习,可以提高高不确定性句的翻译质量,并有利于目标方对低频字句的预测。