根据连接和个人差异解释各种字嵌入的盲人信号分解 (Blind signal decomposition of various word embeddings based on join and individual variance explained)

In recent years, natural language processing (NLP) has become one of the most important areas with various applications in human's life. As the most fundamental task, the field of word embedding still requires more attention and research. Currently, existing works about word embedding are focusing on proposing novel embedding algorithms and dimension reduction techniques on well-trained word embeddings. In this paper, we propose to use a novel joint signal separation method - JIVE to jointly decompose various trained word embeddings into joint and individual components. Through this decomposition framework, we can easily investigate the similarity and difference among different word embeddings. We conducted extensive empirical study on word2vec, FastText and GLoVE trained on different corpus and with different dimensions. We compared the performance of different decomposed components based on sentiment analysis on Twitter and Stanford sentiment treebank. We found that by mapping different word embeddings into the joint component, sentiment performance can be greatly improved for the original word embeddings with lower performance. Moreover, we found that by concatenating different components together, the same model can achieve better performance. These findings provide great insights into the word embeddings and our work offer a new of generating word embeddings by fusing.

翻译：近年来,自然语言处理(NLP)已成为人类生活中应用各种应用的最重要领域之一。作为最基本的任务,语言嵌入领域仍然需要更多关注和研究。目前,关于语言嵌入的现有工作重点是在训练有素的字嵌入中提出新型嵌入算法和减少维度的技术。在本文中,我们提议使用一种新型的联合信号分离方法 — — 联合将经过训练的文字嵌入到联合和单个组成部分中,通过这个分解框架,我们可以很容易地调查不同词嵌入中的相似性和差异。我们进行了关于字嵌入领域的广泛经验研究,在不同领域和不同层面进行了培训。我们比较了基于Twitter和斯坦福情感树库情感分析的不同分解的构件的性能。我们发现,通过绘制将不同词嵌入联合组成部分的不同词的图谱,情绪表现可以大大改善。此外,我们发现,通过将不同的构成不同的构件放在一起,同一模型可以实现更好的性能。这些发现,通过将新词嵌入新的单词,可以提供伟大的洞洞察力。

相关内容

词向量表示

关注 37

分散式表示即将语言表示为稠密、低维、连续的向量。研究者最早发现学习得到词嵌入之间存在类比关系。比如apple−apples ≈ car−cars， man−woman ≈ king – queen 等。这些方法都可以直接在大规模无标注语料上进行训练。词嵌入的质量也非常依赖于上下文窗口大小的选择。通常大的上下文窗口学到的词嵌入更反映主题信息，而小的上下文窗口学到的词嵌入更反映词的功能和上下文语义信息。

自然语言处理顶会EMNLP2020接受论文列表，754篇论文都在这儿了！

专知会员服务

28+阅读 · 2020年10月26日

【微软亚洲研究院】无监督词嵌入对齐的几何感知域自适应，Geometry-aware Domain Adaptation for Unsupervised Alignment of Word Embeddings

专知会员服务

23+阅读 · 2020年4月21日

面向结构化数据的向量嵌入理论 | word2vec, node2vec, graph2vec, X2vec: Towards a Theory of Vector Embeddings of Structured Data

专知会员服务

52+阅读 · 2020年4月1日

【EMNLP 2019 最佳论文】信息瓶颈专门化单词嵌入（用于解析）（Specializing Word Embeddings（for Parsing）by Information Bottleneck）

专知会员服务

24+阅读 · 2019年11月20日