The notion of word embedding plays a fundamental role in natural language processing (NLP). However, pre-training word embedding for very large-scale vocabulary is computationally challenging for most existing methods. In this work, we show that with merely a small fraction of contexts (Q-contexts)which are typical in the whole corpus (and their mutual information with words), one can construct high-quality word embedding with negligible errors. Mutual information between contexts and words can be encoded canonically as a sampling state, thus, Q-contexts can be fast constructed. Furthermore, we present an efficient and effective WEQ method, which is capable of extracting word embedding directly from these typical contexts. In practical scenarios, our algorithm runs 11$\sim$13 times faster than well-established methods. By comparing with well-known methods such as matrix factorization, word2vec, GloVeand fasttext, we demonstrate that our method achieves comparable performance on a variety of downstream NLP tasks, and in the meanwhile maintains run-time and resource advantages over all these baselines.
翻译:嵌入字的概念在自然语言处理(NLP)中起着根本作用。然而,为非常大型词汇嵌入的预培训词在计算上对大多数现有方法都具有挑战性。在这项工作中,我们显示,只要只有一小部分背景(Q-contexts),而这种背景(Q-contexts)在整个文体中是典型的(以及它们用文字提供的相互信息),就可以构建高品质的单词,嵌入可忽略不计的错误。背景和文字之间的相互信息可以作为一个抽样国家进行编译,因此,Q-context可以快速构建。此外,我们提出了一个高效有效的WEQ方法,能够从这些典型环境中提取直接嵌入字。在实际情况下,我们的算法比既定方法要快11$\sim13倍。我们通过与众所周知的矩阵因子化、Word2vec、GloVe和快文本等方法进行比较,证明我们的方法在下游国家文任务中取得了可比的绩效,同时保持所有这些基线的运行时间和资源优势。