利用知识蒸馏改进全合成数据改善神经机器翻译 (Fully Synthetic Data Improves Neural Machine Translation withKnowledge Distillation)

This paper explores augmenting monolingual data for knowledge distillation in neural machine translation. Source language monolingual text can be incorporated as a forward translation. Interestingly, we find the best way to incorporate target language monolingual text is to translate it to the source language and round-trip translate it back to the target language, resulting in a fully synthetic corpus. We find that combining monolingual data from both source and target languages yields better performance than a corpus twice as large only in one language. Moreover, experiments reveal that the improvement depends upon the provenance of the test set. If the test set was originally in the source language (with the target side written by translators), then forward translating source monolingual data matters. If the test set was originally in the target language (with the source written by translators), then incorporating target monolingual data matters.

翻译：本文探讨增加单一语言数据,用于神经机翻译中的知识蒸馏。原始语言单一语言文本可以作为前期翻译纳入。有趣的是,我们找到采用目标语言单一语言文本的最佳途径是将其翻译成源语言,并进行回转,将其翻译成目标语言,从而形成完整的合成材料。我们发现,将源语言和目标语言的单一语言数据结合起来,其性能比仅一种语言的单语言数据份量高出一倍。此外,实验显示,改进取决于测试集的出处。如果测试集最初使用源语言(由翻译编写目标侧),然后将源语言单一语言数据事项提前翻译。如果测试集最初使用目标语言(由翻译编写来源),然后纳入目标单一语言数据事项。

相关内容

Machine Translation

关注 209

机器翻译（Machine Translation）涵盖计算语言学和语言工程的所有分支，包含多语言方面。特色论文涵盖理论，描述或计算方面的任何下列主题:双语和多语语料库的编写和使用，计算机辅助语言教学，非罗马字符集的计算含义，连接主义翻译方法，对比语言学等。官网地址：http://dblp.uni-trier.de/db/journals/mt/

知识图谱嵌入模型的概率标定,Probability Calibration for Knowledge Graph Embedding Models

专知会员服务

36+阅读 · 2020年5月11日

【微软】大型神经语言模型的对抗性训练，Adversarial Training for Large Neural Language Models

专知会员服务

51+阅读 · 2020年5月3日

多语言神经机器翻译综述论文，34页pdf，A Comprehensive Survey of Multilingual Neural Machine Translation

专知会员服务

19+阅读 · 2020年4月25日

【跨语言BERT模型大集合】Transfer learning is increasingly going multilingual with language-specific BERT models

专知会员服务

54+阅读 · 2020年1月30日