In data dominated systems and applications, a concept of representing words in a numerical format has gained a lot of attention. There are a few approaches used to generate such a representation. An interesting issue that should be considered is the ability of such representations - called embeddings - to imitate human-based semantic similarity between words. In this study, we perform a fuzzy-based analysis of vector representations of words, i.e., word embeddings. We use two popular fuzzy clustering algorithms on count-based word embeddings, known as GloVe, of different dimensionality. Words from WordSim-353, called the gold standard, are represented as vectors and clustered. The results indicate that fuzzy clustering algorithms are very sensitive to high-dimensional data, and parameter tuning can dramatically change their performance. We show that by adjusting the value of the fuzzifier parameter, fuzzy clustering can be successfully applied to vectors of high - up to one hundred - dimensions. Additionally, we illustrate that fuzzy clustering allows to provide interesting results regarding membership of words to different clusters.
翻译:在数据占主导地位的系统和应用程序中,以数字格式代表单词的概念引起了人们的极大关注。 使用了一些方法来产生这种代表。 一个值得考虑的有趣问题是,这种代表(称为嵌入式)是否有能力模仿基于人类的词义相似性。 在这项研究中,我们对单词的矢量表示方式进行了模糊分析,即字嵌入。 我们用两种流行的、叫做GloVe的基于数字的字嵌入式模糊组合算法,称为GloVe,不同维度。 WordSim-353(称为黄金标准)的单词是矢量和组合。 结果表明,模糊的组合算法对高维数据非常敏感,参数调整可以大大改变它们的性能。 我们表明,通过调整模糊参数的价值,模糊的组合可以成功地应用到高至100维的矢量。 此外,我们说明,模糊的组合可以在不同组群群的单词归属方面提供有趣的结果。