以分类模块性评价单词嵌入 (Evaluating Word Embeddings with Categorical Modularity)

We introduce categorical modularity, a novel low-resource intrinsic metric to evaluate word embedding quality. Categorical modularity is a graph modularity metric based on the $k$-nearest neighbor graph constructed with embedding vectors of words from a fixed set of semantic categories, in which the goal is to measure the proportion of words that have nearest neighbors within the same categories. We use a core set of 500 words belonging to 59 neurobiologically motivated semantic categories in 29 languages and analyze three word embedding models per language (FastText, MUSE, and subs2vec). We find moderate to strong positive correlations between categorical modularity and performance on the monolingual tasks of sentiment analysis and word similarity calculation and on the cross-lingual task of bilingual lexicon induction both to and from English. Overall, we suggest that categorical modularity provides non-trivial predictive information about downstream task performance, with breakdowns of correlations by model suggesting some meta-predictive properties about semantic information loss as well.

翻译：我们引入了绝对模块化,这是用于评价嵌入语言质量的新颖的低资源内在度量。分类模块化是一种图形模块化度量,它基于由固定的语义分类组中嵌入的文字矢量组成的$k$最近的相邻图形,其目标在于测量同一类别中近邻的单词比例。我们用29种语言使用59种神经生物驱动的语义分类的500个核心字,并分析三种词嵌入模式(FastText、MUSE和子二维茨 ) 。我们发现,在单一语言的情绪分析和词义相似性计算以及双语词汇感应到英语和从英语的跨语言任务上,绝对模块化提供了非三维的下游任务性预测性信息,同时通过模型对相关性进行细分,表明关于语义信息损失的一些元前特性。

相关内容

词向量表示

关注 37

分散式表示即将语言表示为稠密、低维、连续的向量。研究者最早发现学习得到词嵌入之间存在类比关系。比如apple−apples ≈ car−cars， man−woman ≈ king – queen 等。这些方法都可以直接在大规模无标注语料上进行训练。词嵌入的质量也非常依赖于上下文窗口大小的选择。通常大的上下文窗口学到的词嵌入更反映主题信息，而小的上下文窗口学到的词嵌入更反映词的功能和上下文语义信息。

专知会员服务

39+阅读 · 2020年11月3日

图节点嵌入(Node Embeddings)概述，9页pdf

专知会员服务

40+阅读 · 2020年8月22日

史上最全！358篇机器学习&自然语言处理综述论文！都这儿了

专知会员服务

129+阅读 · 2020年7月18日

【机器学习术语宝典】机器学习中英文术语表

专知会员服务

61+阅读 · 2020年7月12日

图像分类技巧集，17页ppt《Bag of Tricks for Image Classification》