网络规模学术名称差异:谁基准、领导板和工具包 (Web-Scale Academic Name Disambiguation: the WhoIsWho Benchmark, Leaderboard, and Toolkit)

Name disambiguation -- a fundamental problem in online academic systems -- is now facing greater challenges with the increasing growth of research papers. For example, on AMiner, an online academic search platform, about 10% of names own more than 100 authors. Such real-world hard cases cannot be fully addressed by existing research efforts, because of the small-scale or low-quality datasets that they use to build algorithms. The development of effective algorithms is further hampered by a variety of tasks and evaluation protocols designed on top of diverse datasets. To this end, we present WhoIsWho owning, a large-scale benchmark with over 1,000,000 papers built using an interactive annotation process, a regular leaderboard with comprehensive tasks, and an easy-to-use toolkit encapsulating the entire pipeline as well as the most powerful features and baseline models for tackling the tasks. Our developed strong baseline has already been deployed online in the AMiner system to enable daily arXiv paper assignments. The documentation and regular leaderboards are publicly available at http://whoiswho.biendata.xyz/.

翻译：名称模糊化 -- -- 在线学术系统的一个根本问题 -- -- 现在随着研究论文的不断增加而面临更大的挑战。例如,在AMiner(一个在线学术搜索平台)上,大约10%的名字拥有100名作者以上。这种真实世界的难题无法通过现有的研究努力得到充分解决,因为它们用来建立算法的数据集规模小或质量低。有效的算法的开发还受到在各种数据集之上设计的各种任务和评价协议的进一步阻碍。为此,我们介绍了“谁拥有”这一大型基准,该基准有1 000 000多份文件,使用互动注解程序制作,一个固定的领头板,有全面的任务,以及一个易于使用的工具包,包罗整个管道以及处理任务的最强大特征和基线模型。我们开发的强大基线已经在AMiner系统中在线部署,以便每天的arXiv纸上任务。文件和常规头板可在http://whoiswho.biendata.xyz/上公开查阅。