Successful methods for unsupervised neural machine translation (UNMT) employ cross-lingual pretraining via self-supervision, often in the form of a masked language modeling or a sequence generation task, which requires the model to align the lexical- and high-level representations of the two languages. While cross-lingual pretraining works for similar languages with abundant corpora, it performs poorly in low-resource, distant languages. Previous research has shown that this is because the representations are not sufficiently aligned. In this paper, we enhance the bilingual masked language model pretraining with lexical-level information by using type-level cross-lingual subword embeddings. Empirical results demonstrate improved performance both on UNMT (up to 4.5 BLEU) and bilingual lexicon induction using our method compared to an established UNMT baseline.
翻译:未经监督的神经机车翻译的成功方法(UNMT)采用通过自我监督的多语种预先培训,通常采取隐蔽语言模型或顺序生成任务的形式,这要求该模型使两种语言的词汇和高级别代表结构保持一致。虽然跨语言预先培训对资源丰富、距离遥远的类似语言有效,但在低资源语言方面表现不佳。先前的研究显示,这是因为这些表达方式不够一致。在本文中,我们通过使用类型级跨语言小词嵌入式,加强了双语隐蔽语言模型培训,并用词汇级信息进行预培训。经验性结果显示,与既定的UNMTT基线相比,我们采用我们的方法在UNMT(最高可达4.5BLEU)和双语词汇上岗培训上的业绩有所改善。