Whether the goal is to estimate the number of people that live in a congressional district, to estimate the number of individuals that have died in an armed conflict, or to disambiguate individual authors using bibliographic data, all these applications have a common theme - integrating information from multiple sources. Before such questions can be answered, databases must be cleaned and integrated in a systematic and accurate way, commonly known as record linkage, de-duplication, or entity resolution. In this article, we review motivational applications and seminal papers that have led to the growth of this area. Specifically, we review the foundational work that began in the 1940's and 50's that have led to modern probabilistic record linkage. We review clustering approaches to entity resolution, semi- and fully supervised methods, and canonicalization, which are being used throughout industry and academia in applications such as human rights, official statistics, medicine, citation networks, among others. Finally, we discuss current research topics of practical importance.
翻译:目标是估计居住在国会辖区内的人数,估计在武装冲突中死亡的人数,或利用文献资料使个人作者产生混淆,所有这些应用都有一个共同的主题——综合来自多种来源的信息,在回答这些问题之前,必须以系统、准确的方式清理和整合数据库,通常称为记录链接、免重复或实体决议。在本篇文章中,我们审查了导致这一领域增长的动机应用程序和基本文件。具体地说,我们审查了1940年代和50年代开始的、导致现代概率记录联系的基础工作。我们审查了实体解决办法、半和完全监督的方法和可解释化的集群方法,这些方法正在整个行业和学术界用于人权、官方统计、医学、引言网络等应用中。最后,我们讨论了当前具有实际重要性的研究课题。