二词嵌入空间模型的内在分析 (Intrinsic analysis for dual word embedding space models)

Recent word embeddings techniques represent words in a continuous vector space, moving away from the atomic and sparse representations of the past. Each such technique can further create multiple varieties of embeddings based on different settings of hyper-parameters like embedding dimension size, context window size and training method. One additional variety appears when we especially consider the Dual embedding space techniques which generate not one but two-word embeddings as output. This gives rise to an interesting question - "is there one or a combination of the two word embeddings variety, which works better for a specific task?". This paper tries to answer this question by considering all of these variations. Herein, we compare two classical embedding methods belonging to two different methodologies - Word2Vec from window-based and Glove from count-based. For an extensive evaluation after considering all variations, a total of 84 different models were compared against semantic, association and analogy evaluations tasks which are made up of 9 open-source linguistics datasets. The final Word2vec reports showcase the preference of non-default model for 2 out of 3 tasks. In case of Glove, non-default models outperform in all 3 evaluation tasks.

翻译：最近的嵌入字技术代表着连续矢量空间中的单词, 远离原子和稀疏的过去表达方式。每一种这样的技术都可以进一步根据超参数的不同设置, 创建多种嵌入式。例如嵌入维度大小、上下文窗口大小和培训方法。当我们特别考虑“ 双嵌入空间技术” 时, 产生一个而不是两个字嵌入输出。这就产生了一个有趣的问题 : “ 是存在一个还是结合两个词嵌入式, 这对于特定任务效果更好? ” 。本文试图通过考虑所有这些变异来回答这个问题。这里, 我们比较了两种属于两种不同方法的经典嵌入式方法 - Word2Vec 和 Glove 。在考虑所有变异之后进行的广泛评估中, 总共84种不同的模型与由9个开放源语言数据集组成的语系、关联和类比评价任务。 Word2vec 最终的报告展示了所有3项任务中的非默认模型的偏好。

相关内容

词向量表示

关注 37

分散式表示即将语言表示为稠密、低维、连续的向量。研究者最早发现学习得到词嵌入之间存在类比关系。比如apple−apples ≈ car−cars， man−woman ≈ king – queen 等。这些方法都可以直接在大规模无标注语料上进行训练。词嵌入的质量也非常依赖于上下文窗口大小的选择。通常大的上下文窗口学到的词嵌入更反映主题信息，而小的上下文窗口学到的词嵌入更反映词的功能和上下文语义信息。

迁移学习简明教程，11页ppt

专知会员服务

108+阅读 · 2020年8月4日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

81+阅读 · 2020年7月26日

【微软亚洲研究院】无监督词嵌入对齐的几何感知域自适应，Geometry-aware Domain Adaptation for Unsupervised Alignment of Word Embeddings

专知会员服务

23+阅读 · 2020年4月21日