We introduce categorical modularity, a novel low-resource intrinsic metric to evaluate word embedding quality. Categorical modularity is a graph modularity metric based on the $k$-nearest neighbor graph constructed with embedding vectors of words from a fixed set of semantic categories, in which the goal is to measure the proportion of words that have nearest neighbors within the same categories. We use a core set of 500 words belonging to 59 neurobiologically motivated semantic categories in 29 languages and analyze three word embedding models per language (FastText, MUSE, and subs2vec). We find moderate to strong positive correlations between categorical modularity and performance on the monolingual tasks of sentiment analysis and word similarity calculation and on the cross-lingual task of bilingual lexicon induction both to and from English. Overall, we suggest that categorical modularity provides non-trivial predictive information about downstream task performance, with breakdowns of correlations by model suggesting some meta-predictive properties about semantic information loss as well.
翻译:我们引入了绝对模块化,这是用于评价嵌入语言质量的新颖的低资源内在度量。分类模块化是一种图形模块化度量,它基于由固定的语义分类组中嵌入的文字矢量组成的$k$最近的相邻图形,其目标在于测量同一类别中近邻的单词比例。我们用29种语言使用59种神经生物驱动的语义分类的500个核心字,并分析三种词嵌入模式(FastText、MUSE和子二维茨 ) 。我们发现,在单一语言的情绪分析和词义相似性计算以及双语词汇感应到英语和从英语的跨语言任务上,绝对模块化提供了非三维的下游任务性预测性信息,同时通过模型对相关性进行细分,表明关于语义信息损失的一些元前特性。