与具体记录差异参数的快速Bayesian记录链接 (Fast Bayesian Record Linkage With Record-Specific Disagreement Parameters)

Researchers are often interested in linking individuals between two datasets that lack a common unique identifier. Matching procedures often struggle to match records with common names, birthplaces or other field values. Computational feasibility is also a challenge, particularly when linking large datasets. We develop a Bayesian method for automated probabilistic record linkage and show it recovers more than 50% more true matches, holding accuracy constant, than comparable methods in a matching of military recruitment data to the 1900 US Census for which expert-labelled matches are available. Our approach, which builds on a recent state-of-the-art Bayesian method, refines the modelling of comparison data, allowing disagreement probability parameters conditional on non-match status to be record-specific in the smaller of the two datasets. This flexibility significantly improves matching when many records share common field values. We show that our method is computationally feasible in practice, despite the added complexity, with an R/C++ implementation that achieves significant improvement in speed over comparable recent methods. We also suggest a lightweight method for treatment of very common names and show how to estimate true positive rate and positive predictive value when true match status is unavailable.

翻译：研究人员往往有兴趣将个人连接到两个缺乏共同独特识别特征的数据集之间。匹配程序往往难以使记录与共同名称、出生地或其他字段值相匹配。计算可行性也是一个挑战,特别是在连接大型数据集时。我们开发了一种巴伊西亚自动概率记录连接方法,显示它回收了50%以上的真实匹配,保持准确性,比可比方法多出50%以上,将军事招募数据与1900年有专家标签匹配的美国普查相匹配。我们的方法以最新最先进的巴伊西亚方法为基础,完善了比较数据的建模,允许以非匹配状态为条件的分歧概率参数在两个数据集的较小部分具有特定记录特性。这种灵活性在很多记录共享共同字段值时大大改进了匹配。我们表明,我们的方法在实际中是可行的,尽管增加了复杂性,但R/C++的落实速度大大高于可比较的近期方法。我们还建议一种较轻的处理非常常见名称的方法,并表明在无法真正匹配的情况下如何估计真实的积极率和积极预测值。