Whether the goal is to estimate the number of people that live in a congressional district, to estimate the number of individuals that have died in an armed conflict, or to disambiguate individual authors using bibliographic data, all these applications have a common theme -- integrating information from multiple sources. Before such questions can be answered, databases must be cleaned and integrated in a systematic and accurate way, commonly known as structured entity resolution (record linkage or de-duplication). In this article, we review motivational applications and seminal papers that have led to the growth of this area. We review modern probabilistic and Bayesian methods in statistics, computer science, machine learning, database management, economics, political science, and other disciplines that are used throughout industry and academia in applications such as human rights, official statistics, medicine, citation networks, among others. Finally, we discuss current research topics of practical importance.
翻译:目标是估计居住在国会辖区的人数,估计在武装冲突中死亡的人数,或利用书目数据使个人作者产生混淆,所有这些应用都有一个共同的主题 -- -- 综合来自多种来源的信息,在回答这些问题之前,必须以系统、准确的方式清理和整合数据库,通常称为结构化实体解决办法(记录联系或重复)。在本篇文章中,我们审查了导致这一领域发展的积极性应用和基本文件。我们审查了统计、计算机科学、机器学习、数据库管理、经济学、政治科学以及整个产业和学术界应用人权、官方统计、医学、引证网络等其他应用中所使用的其他学科的现代概率和巴耶斯方法。最后,我们讨论了当前具有实际重要性的研究课题。