快速从 Q 文字中提取嵌入的单词 (Fast Extraction of Word Embedding from Q-contexts)

The notion of word embedding plays a fundamental role in natural language processing (NLP). However, pre-training word embedding for very large-scale vocabulary is computationally challenging for most existing methods. In this work, we show that with merely a small fraction of contexts (Q-contexts)which are typical in the whole corpus (and their mutual information with words), one can construct high-quality word embedding with negligible errors. Mutual information between contexts and words can be encoded canonically as a sampling state, thus, Q-contexts can be fast constructed. Furthermore, we present an efficient and effective WEQ method, which is capable of extracting word embedding directly from these typical contexts. In practical scenarios, our algorithm runs 11$\sim$13 times faster than well-established methods. By comparing with well-known methods such as matrix factorization, word2vec, GloVeand fasttext, we demonstrate that our method achieves comparable performance on a variety of downstream NLP tasks, and in the meanwhile maintains run-time and resource advantages over all these baselines.

翻译：嵌入字的概念在自然语言处理(NLP)中起着根本作用。然而,为非常大型词汇嵌入的预培训词在计算上对大多数现有方法都具有挑战性。在这项工作中,我们显示,只要只有一小部分背景(Q-contexts),而这种背景(Q-contexts)在整个文体中是典型的(以及它们用文字提供的相互信息),就可以构建高品质的单词,嵌入可忽略不计的错误。背景和文字之间的相互信息可以作为一个抽样国家进行编译,因此,Q-context可以快速构建。此外,我们提出了一个高效有效的WEQ方法,能够从这些典型环境中提取直接嵌入字。在实际情况下,我们的算法比既定方法要快11$\sim13倍。我们通过与众所周知的矩阵因子化、Word2vec、GloVe和快文本等方法进行比较,证明我们的方法在下游国家文任务中取得了可比的绩效,同时保持所有这些基线的运行时间和资源优势。

相关内容

词向量表示

关注 37

分散式表示即将语言表示为稠密、低维、连续的向量。研究者最早发现学习得到词嵌入之间存在类比关系。比如apple−apples ≈ car−cars， man−woman ≈ king – queen 等。这些方法都可以直接在大规模无标注语料上进行训练。词嵌入的质量也非常依赖于上下文窗口大小的选择。通常大的上下文窗口学到的词嵌入更反映主题信息，而小的上下文窗口学到的词嵌入更反映词的功能和上下文语义信息。

【AAAI 2019】双曲异构信息网络嵌入，Hyperbolic Heterogeneous Information Network Embedding

专知会员服务

60+阅读 · 2020年6月28日

【2020关键词提取】基于深度神经网络的关键词提取，Keywords extraction with deep neural network model

专知会员服务

60+阅读 · 2020年5月2日

【2020关键词提取】医学报告的关键词提取和结构化，Keyword extraction and structuralization of medical reports

专知会员服务

33+阅读 · 2020年5月2日

【NLP模型压缩方法综述】《A Survey of Methods for Model Compression in NLP》by Madison May

专知会员服务

43+阅读 · 2020年4月22日