Neural machine translation (NMT) has progressed rapidly over the past several years, and modern models are able to achieve relatively high quality using only monolingual text data, an approach dubbed Unsupervised Machine Translation (UNMT). However, these models still struggle in a variety of ways, including aspects of translation that for a human are the easiest - for instance, correctly translating common nouns. This work explores a cheap and abundant resource to combat this problem: bilingual lexica. We test the efficacy of bilingual lexica in a real-world set-up, on 200-language translation models trained on web-crawled text. We present several findings: (1) using lexical data augmentation, we demonstrate sizable performance gains for unsupervised translation; (2) we compare several families of data augmentation, demonstrating that they yield similar improvements, and can be combined for even greater improvements; (3) we demonstrate the importance of carefully curated lexica over larger, noisier ones, especially with larger models; and (4) we compare the efficacy of multilingual lexicon data versus human-translated parallel data. Finally, we open-source GATITOS (available at https://github.com/google-research/url-nlp/tree/main/gatitos), a new multilingual lexicon for 26 low-resource languages, which had the highest performance among lexica in our experiments.
翻译:神经机器翻译(NMT)在过去几年中取得了快速进展,现代模型能够单纯使用单语文本数据实现相对高的质量,这种方法被称为无监督机器翻译(UNMT)。然而,这些模型仍然在许多方面存在问题,包括翻译中对于人类来说最容易的方面,例如正确翻译常见名词。本文探讨了一种廉价而丰富的资源来解决这个问题:双语词典。我们在实际设置中测试了双语词典的有效性,针对基于网络爬取的文本训练的200种语言翻译模型。我们提出了几个发现:(1)使用词汇数据增强,我们展示了无监督翻译的可观性能提升;(2)我们比较了几个数据增强家族,证明它们产生了类似的改进,并可以组合以获得更大的改进;(3)我们证明了经过仔细筛选的词典比较大、更嘈杂的词典更重要,特别是在较大的模型中;以及(4)我们比较了多语言词典数据和人工翻译的平行数据的有效性。最后,我们开源了GATITOS(可在https://github.com/google-research/url-nlp/tree/main/gatitos上获得),这是26种低资源语言的新多语言词典,在我们的实验中表现最好。