This paper explores augmenting monolingual data for knowledge distillation in neural machine translation. Source language monolingual text can be incorporated as a forward translation. Interestingly, we find the best way to incorporate target language monolingual text is to translate it to the source language and round-trip translate it back to the target language, resulting in a fully synthetic corpus. We find that combining monolingual data from both source and target languages yields better performance than a corpus twice as large only in one language. Moreover, experiments reveal that the improvement depends upon the provenance of the test set. If the test set was originally in the source language (with the target side written by translators), then forward translating source monolingual data matters. If the test set was originally in the target language (with the source written by translators), then incorporating target monolingual data matters.
翻译:本文探讨增加单一语言数据,用于神经机翻译中的知识蒸馏。 原始语言单一语言文本可以作为前期翻译纳入。 有趣的是,我们找到采用目标语言单一语言文本的最佳途径是将其翻译成源语言,并进行回转,将其翻译成目标语言,从而形成完整的合成材料。 我们发现,将源语言和目标语言的单一语言数据结合起来,其性能比仅一种语言的单语言数据份量高出一倍。 此外,实验显示,改进取决于测试集的出处。 如果测试集最初使用源语言(由翻译编写目标侧),然后将源语言单一语言数据事项提前翻译。 如果测试集最初使用目标语言(由翻译编写来源),然后纳入目标单一语言数据事项。