神经机器翻译单语数据不确定性的自我培训抽样 (Self-Training Sampling with Monolingual Data Uncertainty for Neural Machine Translation)

Self-training has proven effective for improving NMT performance by augmenting model training with synthetic parallel data. The common practice is to construct synthetic data based on a randomly sampled subset of large-scale monolingual data, which we empirically show is sub-optimal. In this work, we propose to improve the sampling procedure by selecting the most informative monolingual sentences to complement the parallel data. To this end, we compute the uncertainty of monolingual sentences using the bilingual dictionary extracted from the parallel data. Intuitively, monolingual sentences with lower uncertainty generally correspond to easy-to-translate patterns which may not provide additional gains. Accordingly, we design an uncertainty-based sampling strategy to efficiently exploit the monolingual data for self-training, in which monolingual sentences with higher uncertainty would be sampled with higher probability. Experimental results on large-scale WMT English$\Rightarrow$German and English$\Rightarrow$Chinese datasets demonstrate the effectiveness of the proposed approach. Extensive analyses suggest that emphasizing the learning on uncertain monolingual sentences by our approach does improve the translation quality of high-uncertainty sentences and also benefits the prediction of low-frequency words at the target side.

翻译：实践是,根据随机抽样的大规模单一语言数据分组构建合成数据,我们从经验中发现这些数据是次优的。在这项工作中,我们提议通过选择信息最丰富的单语句来改进抽样程序,以补充平行数据。为此,我们用从平行数据中提取的双语字典计算单语句的不确定性。直觉地计算出单语句的不确定性,低不确定性的单语句通常与容易翻译的模式相对应,而这种模式可能不会带来额外收益。因此,我们设计基于不确定性的抽样战略,有效利用单语语数据进行自我培训,在其中,对单语句的不确定性较高句子将进行抽样,概率更高。大规模英语和英语的实验结果显示了拟议方法的有效性。广泛的分析表明,通过我们的方法强调对不确定的单语句的学习,可以提高高不确定性句的翻译质量,并有利于目标方对低频字句的预测。

相关内容

Machine Translation

关注 209

机器翻译（Machine Translation）涵盖计算语言学和语言工程的所有分支，包含多语言方面。特色论文涵盖理论，描述或计算方面的任何下列主题:双语和多语语料库的编写和使用，计算机辅助语言教学，非罗马字符集的计算含义，连接主义翻译方法，对比语言学等。官网地址：http://dblp.uni-trier.de/db/journals/mt/

【Facebook AI】无监督机器翻译，336页ppt，Unsupervised Machine Translation

专知会员服务

19+阅读 · 2020年11月17日

【干货书】真实机器学习，264页pdf，Real-World Machine Learning

专知会员服务

115+阅读 · 2020年4月5日

【Google】无监督机器翻译，Unsupervised Machine Translation

专知会员服务

36+阅读 · 2020年3月3日