We describe an unsupervised method to create pseudo-parallel corpora for machine translation (MT) from unaligned text. We use multilingual BERT to create source and target sentence embeddings for nearest-neighbor search and adapt the model via self-training. We validate our technique by extracting parallel sentence pairs on the BUCC 2017 bitext mining task and observe up to a 24.5 point increase (absolute) in F1 scores over previous unsupervised methods. We then improve an XLM-based unsupervised neural MT system pre-trained on Wikipedia by supplementing it with pseudo-parallel text mined from the same corpus, boosting unsupervised translation performance by up to 3.5 BLEU on the WMT'14 French-English and WMT'16 German-English tasks and outperforming the previous state-of-the-art. Finally, we enrich the IWSLT'15 English-Vietnamese corpus with pseudo-parallel Wikipedia sentence pairs, yielding a 1.2 BLEU improvement on the low-resource MT task. We demonstrate that unsupervised bitext mining is an effective way of augmenting MT datasets and complements existing techniques like initializing with pre-trained contextual embeddings.
翻译:我们描述一种未经监督的方法, 用来从不结盟文本中为机器翻译创建假的单人公司( MT) 。 我们使用多语言的 BERT 创建源和目标句嵌入器, 用于近邻搜索并通过自我训练对模型进行修改。 我们验证我们的技术, 在 BUCC 2017 位数采矿任务中提取平行的句子配对, 并在 F1 中观察到比先前未经监督的方法增加24.5 分( 绝对) 。 我们然后改进一个基于 XLM 的、 未经监督的神经MT系统, 事先在维基百科上培训, 补充从同一物理中提取的伪单词, 提升无监督的翻译性能, 在 WMT'14 法语英语和 WMT 16 德语英语任务中提升到3.5 BLEU, 并超越了先前的状态。 最后, 我们用假的维基百科判刑组合来丰富IWSLT'15 英越权,, 在低资源MT任务上实现1.2 BLEU的改进。 我们展示了未受监督的初始化的嵌底基采矿技术, 和升级的原始技术, 正在加强。