理解跨语言单词嵌入式绘图的线性 (Understanding Linearity of Cross-Lingual Word Embedding Mappings)

The technique of Cross-Lingual Word Embedding (CLWE) plays a fundamental role in tackling Natural Language Processing challenges for low-resource languages. Its dominant approaches assumed that the relationship between embeddings could be represented by a linear mapping, but there has been no exploration of the conditions under which this assumption holds. Such a research gap becomes very critical recently, as it has been evidenced that relaxing mappings to be non-linear can lead to better performance in some cases. We, for the first time, present a theoretical analysis that identifies the preservation of analogies encoded in monolingual word embeddings as a necessary and sufficient condition for the ground-truth CLWE mapping between those embeddings to be linear. On a novel cross-lingual analogy dataset that covers five representative analogy categories for twelve distinct languages, we carry out experiments which provide direct empirical support for our theoretical claim. These results offer additional insight into the observations of other researchers and contribute inspiration for the development of more effective cross-lingual representation learning strategies.

翻译：跨语言文字嵌入技术(CLWE)在应对低资源语言的自然语言处理挑战方面发挥着根本作用,它的主要方法假定嵌入之间的关系可以通过线性绘图来体现,但并未探讨这一假设所处的条件。这种研究差距最近变得非常关键,因为已经证明,放松绘图是非线性的,在某些情况下可以带来更好的表现。我们第一次提出理论分析,将保存以单语语言嵌入编码的模拟词确定为这些嵌入的CLWE地面真象绘图的必要和充分条件。关于涵盖12种不同语言五个代表性类比的新的跨语言类比类比数据集,我们进行了实验,为我们的理论主张提供了直接的经验支持。这些实验结果为其他研究人员的观察提供了更多的见解,并为制定更有效的跨语言教学战略提供了启发。

相关内容

词向量表示

关注 37

分散式表示即将语言表示为稠密、低维、连续的向量。研究者最早发现学习得到词嵌入之间存在类比关系。比如apple−apples ≈ car−cars， man−woman ≈ king – queen 等。这些方法都可以直接在大规模无标注语料上进行训练。词嵌入的质量也非常依赖于上下文窗口大小的选择。通常大的上下文窗口学到的词嵌入更反映主题信息，而小的上下文窗口学到的词嵌入更反映词的功能和上下文语义信息。

对比学习简述

专知会员服务

90+阅读 · 2021年6月29日

MIT经典《线性代数》，584页pdf，Introduction to Linear Algebra, Fifth Edition, Gilbert Strang, 2016.

专知会员服务

429+阅读 · 2021年1月11日

100+篇《自监督学习(Self-Supervised Learning)》论文最新合集

专知会员服务

166+阅读 · 2020年3月18日

【跨语言BERT模型大集合】Transfer learning is increasingly going multilingual with language-specific BERT models

专知会员服务

54+阅读 · 2020年1月30日