We present the parametric method SemSimp aimed at measuring semantic similarity of digital resources. SemSimp is based on the notion of information content, and it leverages a reference ontology and taxonomic reasoning, encompassing different approaches for weighting the concepts of the ontology. In particular, weights can be computed by considering either the available digital resources or the structure of the reference ontology of a given domain. SemSimp is assessed against six representative semantic similarity methods for comparing sets of concepts proposed in the literature, by carrying out an experimentation that includes both a statistical analysis and an expert judgement evaluation. To the purpose of achieving a reliable assessment, we used a real-world large dataset based on the Digital Library of the Association for Computing Machinery (ACM), and a reference ontology derived from the ACM Computing Classification System (ACM-CCS). For each method, we considered two indicators. The first concerns the degree of confidence to identify the similarity among the papers belonging to some special issues selected from the ACM Transactions on Information Systems journal, the second the Pearson correlation with human judgement. The results reveal that one of the configurations of SemSimp outperforms the other assessed methods. An additional experiment performed in the domain of physics shows that, in general, SemSimp provides better results than the other similarity methods.
翻译:我们提出了旨在测量数字资源的语义相似性的Semsimp 参数方法,旨在测量数字资源的语义相似性的Semsimp 。Semsimp 是基于信息内容的概念,它利用参考本体学和分类学推理,其中包括对本体学概念加权的不同方法,特别是可以通过考虑现有数字资源或某一域参考肿瘤学结构来计算加权。Semsimp 是根据六个具有代表性的语义相似性方法进行评估,这些方法用来比较文献中提议的各套概念,方法是进行实验,既包括统计分析,也包括专家判断评价。为了实现可靠的评估,我们利用了一个基于计算机机械协会数字图书馆(ACM)的真实世界大数据集,以及从ACM 计算机学分类系统(ACM-CS)中衍生出的一个目录学参考。关于每种方法,我们考虑了两个指标。首先涉及确定信息系统日志中选定的某些特殊问题的文件的相似性的信心程度,其次于Pearson与人类判断的关联性。为了实现可靠评估的目的,我们使用了一个基于计算机机器(ACM)数字图书馆(ACM-CS)的数字图书馆(ACM-CS-CS)的数字分析结果,结果显示一种类似于SemSimpalimpalims 的另一种方法。