To train algorithms for supervised author name disambiguation, many studies have relied on hand-labeled truth data that are very laborious to generate. This paper shows that labeled training data can be automatically generated using information features such as email address, coauthor names, and cited references that are available from publication records. For this purpose, high-precision rules for matching name instances on each feature are decided using an external-authority database. Then, selected name instances in target ambiguous data go through the process of pairwise matching based on the rules. Next, they are merged into clusters by a generic entity resolution algorithm. The clustering procedure is repeated over other features until further merging is impossible. Tested on 26,566 instances out of the population of 228K author name instances, this iterative clustering produced accurately labeled data with pairwise F1 = 0.99. The labeled data represented the population data in terms of name ethnicity and co-disambiguating name group size distributions. In addition, trained on the labeled data, machine learning algorithms disambiguated 24K names in test data with performance of pairwise F1 = 0.90 ~ 0.92. Several challenges are discussed for applying this method to resolving author name ambiguity in large-scale scholarly data.
翻译:为培训监督作者姓名脱节的算法,许多研究都依赖手工标签的真伪数据,而这些数据很难生成。本文显示,标签的培训数据可以使用电子邮件地址、共同作者姓名等信息特征以及出版物记录中引用的参考文献自动生成。为此,使用外部授权数据库决定了每个特征匹配名称实例的高精度规则。然后,目标模棱两可数据中选定的名字实例通过基于规则的对称匹配过程进行。接下来,它们由通用实体解析算法合并成群集。在进一步合并之前,群集程序重复使用其他特性。在228K作者名称案例中,26 566个实例进行测试,这种迭代组生成了精确的标签数据,配对F1=0.99。 标签数据代表了按名称族裔和共同模糊名称组大小分布的人口数据。此外,根据标签数据培训,机器学习算法在测试数据中的24K名称与配对式F1=0.90-0.92的功能混为一格。 将数据应用于大规模解析式的作者。