Author name disambiguation results are often evaluated by measures such as Cluster-F, K-metric, Pairwise-F, Splitting & Lumping Error, and B-cubed. Although these measures have distinctive evaluation schemes, this paper shows that they can be calculated in a single framework by a set of common steps that compare truth and predicted clusters through two hash tables recording information about name instances with their predicted cluster indices and frequencies of those indices per truth cluster. This integrative calculation reduces greatly calculation runtime, which is scalable to a clustering task involving millions of name instances within a few seconds. During the integration process, B-cubed and K-metric are shown to produce the same precision and recall scores. In this framework, especially, name instance pairs for Pairwise-F are counted using a heuristic, surpassing a state-of-the-art algorithm in speedy calculation. Details of the integrative calculation are described with examples and pseudo-code to assist scholars to implement each measure easily and validate the correctness of implementation. The integrative calculation will help scholars compare similarities and differences of multiple measures before they select ones that characterize best the clustering performances of their disambiguation methods.
翻译:作者姓名模糊性结果通常通过Croup-F、K-计量、Pairwise-F、分解和翻转错误以及B-cubed等措施进行评估。虽然这些措施有不同的评估方案,但本文件表明,它们可以在一个单一的框架内通过一套共同的步骤来计算,这些步骤通过两个散列表来比较真相和预测的群集,这些群集通过两个散列表来记录姓名实例的信息及其预测群集指数和每组真相指数的频率。这种综合计算极大地减少了计算运行时间,这可以适用于在几秒钟内涉及数百万名实例的群集任务。在集成过程中,B-cud和K-度显示,得出相同的精确度和回顾分数。在这个框架中,特别是,对Pairwise-F的国名实例配对的计算使用超前期算法,在快速计算中超过了最新算法。综合计算的细节用示例和假码来描述,以协助学者执行每一项措施,并验证执行的正确性。综合计算将有助于学者比较多种措施的相似性和差异。在他们选择最能体现其不成熟方法的组合表现的方法之前,例如和假代码。