Static word embeddings encode word associations, extensively utilized in downstream NLP tasks. Although prior studies have discussed the nature of such word associations in terms of biases and lexical regularities captured, the variation in word associations based on the embedding training procedure remains in obscurity. This work aims to address this gap by assessing attributive word associations across five different static word embedding architectures, analyzing the impact of the choice of the model architecture, context learning flavor and training corpora. Our approach utilizes a semi-supervised clustering method to cluster annotated proper nouns and adjectives, based on their word embedding features, revealing underlying attributive word associations formed in the embedding space, without introducing any confirmation bias. Our results reveal that the choice of the context learning flavor during embedding training (CBOW vs skip-gram) impacts the word association distinguishability and word embeddings' sensitivity to deviations in the training corpora. Moreover, it is empirically shown that even when trained over the same corpora, there is significant inter-model disparity and intra-model similarity in the encoded word associations across different word embedding models, portraying specific patterns in the way the embedding space is created for each embedding architecture.
翻译:虽然先前的研究已经从偏见和词汇规律的角度讨论了此类词汇协会的性质,但基于嵌入培训程序的文字协会的变异仍然是隐蔽的。 这项工作的目的是通过评估五种不同静态词嵌入结构的属性字协会,分析模式结构选择、背景学习口味和培训公司的影响,分析模式结构选择、背景学习口味和培训公司的影响。我们的方法利用半监督的集群方法,将一个附带说明的适当名词和形容词分组在一起,基于其词嵌入特征,揭示在嵌入空间中形成的根本性的归并字协会,而没有引入任何确认偏见。我们的结果显示,在嵌入培训(CBOW 与跳过)过程中选择背景学习口味会影响该词的区别和词嵌入对培训公司偏差的敏感度。此外,我们的经验表明,即使对同一公司进行了培训,也存在显著的建模差异和内部建模在嵌入空间空间空间结构中呈现的内嵌入式模式。