Word embedding has become ubiquitous and is widely used in various text mining and natural language processing (NLP) tasks, such as information retrieval, semantic analysis, and machine translation, among many others. Unfortunately, it is prohibitively expensive to train the word embedding in a relatively large corpus. We propose a graph-based word embedding algorithm, called Word-Graph2vec, which converts the large corpus into a word co-occurrence graph, then takes the word sequence samples from this graph by randomly traveling and trains the word embedding on this sampling corpus in the end. We posit that because of the stable vocabulary, relative idioms, and fixed expressions in English, the size and density of the word co-occurrence graph change slightly with the increase in the training corpus. So that Word-Graph2vec has stable runtime on the large scale data set, and its performance advantage becomes more and more obvious with the growth of the training corpus. Extensive experiments conducted on real-world datasets show that the proposed algorithm outperforms traditional Skip-Gram by four-five times in terms of efficiency, while the error generated by the random walk sampling is small.
翻译:字嵌入已变得无处不在, 并被广泛用于各种文字挖掘和自然语言处理( NLP) 任务中, 比如信息检索、 语义分析和机器翻译等 。 不幸的是, 将字嵌入到相对大的内容中, 花费太高了 。 我们提议了一个基于图形的字嵌入算法, 名为 Word- Grapph2vec, 将大体转换成单词共读图形, 然后通过随机旅行从此图中取出字序列样本, 并在最后将字嵌入此抽样中。 我们假设, 由于英文的词汇、 相对语义和固定表达方式稳定, 将单词的大小和密度随培训内容的增加而略有变化。 因此, Word- Grap2vec 在大型数据集中拥有稳定的运行时间, 其性能优势随着培训内容的增长而变得越来越明显。 在现实世界数据集上进行的广泛实验显示, 拟议的算法以四至五次的频率代表了传统的跳过小型跳格。 而随机抽样则会产生错误。