State-of-the-art methods for learning cross-lingual word embeddings have relied on bilingual dictionaries or parallel corpora. Recent studies showed that the need for parallel data supervision can be alleviated with character-level information. While these methods showed encouraging results, they are not on par with their supervised counterparts and are limited to pairs of languages sharing a common alphabet. In this work, we show that we can build a bilingual dictionary between two languages without using any parallel corpora, by aligning monolingual word embedding spaces in an unsupervised way. Without using any character information, our model even outperforms existing supervised methods on cross-lingual tasks for some language pairs. Our experiments demonstrate that our method works very well also for distant language pairs, like English-Russian or English-Chinese. We finally describe experiments on the English-Esperanto low-resource language pair, on which there only exists a limited amount of parallel data, to show the potential impact of our method in fully unsupervised machine translation. Our code, embeddings and dictionaries are publicly available.
翻译:最新研究表明,使用双语词典或平行的语体嵌入法,可以减少平行数据监督的需要。虽然这些方法显示的结果令人鼓舞,但它们与受监督的对应方不同,而且仅限于使用共用字母的两种语言。在这项工作中,我们表明,我们可以在两种语言之间建立一个双语词典,而不必使用任何平行的语体,方法是将单语词嵌入空间以不受监督的方式加以协调。在不使用任何字符信息的情况下,我们的模型比某些语言配对的现有跨语言任务监督方法要好。我们的实验表明,我们的方法对于遥远的语言配对(如英语-俄语或英语-中文)也非常有效。我们最后描述了英语-埃斯珀兰托语低资源语言配对的实验,而这种配对只有数量有限的平行数据,以显示我们方法在完全不受监督的机器翻译中的潜在影响。我们的代码、嵌入和词典是公开的。