无人监督的多语种多语种词嵌入式 (Unsupervised Multilingual Word Embeddings)

Multilingual Word Embeddings (MWEs) represent words from multiple languages in a single distributional vector space. Unsupervised MWE (UMWE) methods acquire multilingual embeddings without cross-lingual supervision, which is a significant advantage over traditional supervised approaches and opens many new possibilities for low-resource languages. Prior art for learning UMWEs, however, merely relies on a number of independently trained Unsupervised Bilingual Word Embeddings (UBWEs) to obtain multilingual embeddings. These methods fail to leverage the interdependencies that exist among many languages. To address this shortcoming, we propose a fully unsupervised framework for learning MWEs that directly exploits the relations between all language pairs. Our model substantially outperforms previous approaches in the experiments on multilingual word translation and cross-lingual word similarity. In addition, our model even beats supervised approaches trained with cross-lingual resources.

翻译：多语言嵌入器(MWE)是一个单一分布矢量空间中来自多种语言的单词。不受监督的MWE(UMWE)方法在没有跨语言监督的情况下获得多语言嵌入器,这是传统监督方法的一大优势,为低资源语言开辟了许多新的可能性。但是,学习多语言嵌入器(MWE)的先行艺术仅仅依靠一些经过独立培训的、未经监督的双语嵌入器(UBWE)来获得多语言嵌入器。这些方法未能利用多种语言之间的相互依存关系。为了解决这一缺陷,我们提出了一个完全不受监督的学习多语言嵌入器框架,直接利用所有语言对口之间的关系。我们的模型大大优于在多语言翻译和跨语言词汇相似性实验中以往的方法。此外,我们的模型甚至比受过跨语言资源培训的受监督的方法。

相关内容

词向量表示

关注 37

分散式表示即将语言表示为稠密、低维、连续的向量。研究者最早发现学习得到词嵌入之间存在类比关系。比如apple−apples ≈ car−cars， man−woman ≈ king – queen 等。这些方法都可以直接在大规模无标注语料上进行训练。词嵌入的质量也非常依赖于上下文窗口大小的选择。通常大的上下文窗口学到的词嵌入更反映主题信息，而小的上下文窗口学到的词嵌入更反映词的功能和上下文语义信息。

【领域对抗学习的低资源文本分类】Low-Resource Text Classification using Domain-Adversarial Learning

专知会员服务

23+阅读 · 2020年4月22日

【微软亚洲研究院】无监督词嵌入对齐的几何感知域自适应，Geometry-aware Domain Adaptation for Unsupervised Alignment of Word Embeddings

专知会员服务

23+阅读 · 2020年4月21日

100+篇《自监督学习(Self-Supervised Learning)》论文最新合集

专知会员服务

165+阅读 · 2020年3月18日

【DeepMind-牛津-CMU-CVPR2020】无监督文字翻译视频中的视觉基础，Visual Grounding in Video for Unsupervised Word Translation

专知会员服务

13+阅读 · 2020年3月12日