无平行数据的单词翻译 (Word Translation Without Parallel Data)

State-of-the-art methods for learning cross-lingual word embeddings have relied on bilingual dictionaries or parallel corpora. Recent studies showed that the need for parallel data supervision can be alleviated with character-level information. While these methods showed encouraging results, they are not on par with their supervised counterparts and are limited to pairs of languages sharing a common alphabet. In this work, we show that we can build a bilingual dictionary between two languages without using any parallel corpora, by aligning monolingual word embedding spaces in an unsupervised way. Without using any character information, our model even outperforms existing supervised methods on cross-lingual tasks for some language pairs. Our experiments demonstrate that our method works very well also for distant language pairs, like English-Russian or English-Chinese. We finally describe experiments on the English-Esperanto low-resource language pair, on which there only exists a limited amount of parallel data, to show the potential impact of our method in fully unsupervised machine translation. Our code, embeddings and dictionaries are publicly available.

翻译：最新研究表明,使用双语词典或平行的语体嵌入法,可以减少平行数据监督的需要。虽然这些方法显示的结果令人鼓舞,但它们与受监督的对应方不同,而且仅限于使用共用字母的两种语言。在这项工作中,我们表明,我们可以在两种语言之间建立一个双语词典,而不必使用任何平行的语体,方法是将单语词嵌入空间以不受监督的方式加以协调。在不使用任何字符信息的情况下,我们的模型比某些语言配对的现有跨语言任务监督方法要好。我们的实验表明,我们的方法对于遥远的语言配对(如英语-俄语或英语-中文)也非常有效。我们最后描述了英语-埃斯珀兰托语低资源语言配对的实验,而这种配对只有数量有限的平行数据,以显示我们方法在完全不受监督的机器翻译中的潜在影响。我们的代码、嵌入和词典是公开的。

相关内容

词向量表示

关注 37

分散式表示即将语言表示为稠密、低维、连续的向量。研究者最早发现学习得到词嵌入之间存在类比关系。比如apple−apples ≈ car−cars， man−woman ≈ king – queen 等。这些方法都可以直接在大规模无标注语料上进行训练。词嵌入的质量也非常依赖于上下文窗口大小的选择。通常大的上下文窗口学到的词嵌入更反映主题信息，而小的上下文窗口学到的词嵌入更反映词的功能和上下文语义信息。

多语言神经机器翻译综述论文，34页pdf，A Comprehensive Survey of Multilingual Neural Machine Translation

专知会员服务

19+阅读 · 2020年4月25日

【DeepMind-牛津-CMU-CVPR2020】无监督文字翻译视频中的视觉基础，Visual Grounding in Video for Unsupervised Word Translation

专知会员服务

13+阅读 · 2020年3月12日

【Google】无监督机器翻译，Unsupervised Machine Translation

专知会员服务

36+阅读 · 2020年3月3日

【Google AI论文】无妥协的弱监督解缠，Weakly-Supervised Disentanglement Without Compromises

专知会员服务

20+阅读 · 2020年2月12日