平行判决的强有力的跨语言语言嵌入 (Robust Cross-lingual Embeddings from Parallel Sentences)

Recent advances in cross-lingual word embeddings have primarily relied on mapping-based methods, which project pretrained word embeddings from different languages into a shared space through a linear transformation. However, these approaches assume word embedding spaces are isomorphic between different languages, which has been shown not to hold in practice (S{\o}gaard et al., 2018), and fundamentally limits their performance. This motivates investigating joint learning methods which can overcome this impediment, by simultaneously learning embeddings across languages via a cross-lingual term in the training objective. Given the abundance of parallel data available (Tiedemann, 2012), we propose a bilingual extension of the CBOW method which leverages sentence-aligned corpora to obtain robust cross-lingual word and sentence representations. Our approach significantly improves cross-lingual sentence retrieval performance over all other approaches, as well as convincingly outscores mapping methods while maintaining parity with jointly trained methods on word-translation. It also achieves parity with a deep RNN method on a zero-shot cross-lingual document classification task, requiring far fewer computational resources for training and inference. As an additional advantage, our bilingual method also improves the quality of monolingual word vectors despite training on much smaller datasets. We make our code and models publicly available.

翻译：在跨语言嵌入方面最近取得的进展主要依靠基于绘图的方法,这些方法预测了通过线性转换将不同语言预设的字嵌入到共享空间的预先培训的字嵌入方法;然而,这些方法假定,将字嵌入空间在不同语言之间是无差异的,这在实践上证明无法维持(S@o}gaard等人,2018年),从根本上限制了其绩效。这促使调查联合学习方法,通过在培训目标中跨语言术语同时学习跨语言嵌入,从而克服这一障碍。鉴于现有大量平行数据(Tiedemann,2012年),我们提议以双语方式扩展CBOW方法,该方法利用与判决一致的公司获得强有力的跨语言的字句和句表达方式。我们的方法大大改进了所有其他方法的跨语言的句检索绩效,并令人信服地超越了核心的绘图方法,同时保持了与经过共同培训的文字翻译方法的等同性。它还实现了与深 RNN方法在零光跨语言文件分类任务上的等同性方法(Tidemann,2012年),因此培训和推断所需的计算资源要少得多。我们现有的双语数据模型也提高了。

相关内容

词向量表示

关注 37

分散式表示即将语言表示为稠密、低维、连续的向量。研究者最早发现学习得到词嵌入之间存在类比关系。比如apple−apples ≈ car−cars， man−woman ≈ king – queen 等。这些方法都可以直接在大规模无标注语料上进行训练。词嵌入的质量也非常依赖于上下文窗口大小的选择。通常大的上下文窗口学到的词嵌入更反映主题信息，而小的上下文窗口学到的词嵌入更反映词的功能和上下文语义信息。

【快讯】ICML 2020论文出炉，1088篇上榜，你的paper中了吗？

专知会员服务

52+阅读 · 2020年6月1日

因果图，Causal Graphs，52页ppt

专知会员服务

252+阅读 · 2020年4月19日

所有跨语言嵌入式都应该讲英语吗? | Should All Cross-Lingual Embeddings Speak English?

专知会员服务

7+阅读 · 2020年4月16日

100+篇《自监督学习(Self-Supervised Learning)》论文最新合集

专知会员服务

166+阅读 · 2020年3月18日