Record linkage is aimed at the accurate and efficient identification of records that represent the same entity within or across disparate databases. It is a fundamental task in data integration and increasingly required for accurate decision making in application domains ranging from health analytics to national security. Traditional record linkage techniques calculate string similarities between quasi-identifying (QID) values, such as the names and addresses of people. Errors, variations, and missing QID values can however lead to low linkage quality because the similarities between records cannot be calculated accurately. To overcome this challenge, we propose a novel technique that can accurately link records even when QID values contain errors or variations, or are missing. We first generate attribute signatures (concatenated QID values) using an Apriori based selection of suitable QID attributes, and then relational signatures that encapsulate relationship information between records. Combined, these signatures can uniquely identify individual records and facilitate fast and high quality linking of very large databases through accurate similarity calculations between records. We evaluate the linkage quality and scalability of our approach using large real-world databases, showing that it can achieve high linkage quality even when the databases being linked contain substantial amounts of missing values and errors.
翻译:记录链接旨在准确和高效地识别代表不同数据库内或不同数据库内同一实体的记录,这是数据整合的一项基本任务,对于从健康分析到国家安全等应用领域的准确决策越来越需要。传统的记录链接技术计算准识别值(QID)之间的字符串相似性,例如人员的姓名和地址。错误、变异和缺失的QID值会导致联系质量低,因为无法准确计算记录之间的相似性。为了克服这一挑战,我们提出了一种新的技术,即使QID值包含错误或变异或缺失,也可以准确链接记录。我们首先使用基于优先选择的适当QID属性生成属性签名(配对QID值),然后使用包含记录间关系信息的关联性签名。加在一起,这些签名可以单独识别个人记录,并通过准确的类似性计算,便利大型数据库的快速和高质量连接。我们用大型真实世界数据库评估我们方法的联系质量和可缩放性,表明即使数据库含有大量缺失的错误和错误,也可以实现高链接质量。