Bayesian 一种将数据链接而无独特识别符的方法 (A Bayesian Approach to Linking Data Without Unique Identifiers)

Existing file linkage methods may produce sub-optimal results because they consider neither the interactions between different pairs of matched records nor relationships between variables that are exclusive to one of the files. In addition, many of the current methods fail to address the uncertainty in the linkage, which may result in overly precise estimates of relationships between variables that are exclusive to one of the files. Bayesian methods for record linkage can reduce the bias in the estimation of scientific relationships of interest and provide interval estimates that account for the uncertainty in the linkage; however, implementation of these methods can often be complex and computationally intensive. This article presents the gfs_sampler package for the Python programming language that utilizes a Bayesian approach for file linkage. The linking procedure implemented in gfs_sampler samples from the joint posterior distribution of model parameters and the linking permutations. The algorithm approaches file linkage as a missing data problem and generates multiple linked data sets. For computational efficiency, only the linkage permutations are stored and multiple analyses are performed using each of the permutations separately. This implementation reduces the computational complexity of the linking process and the expertise required of researchers analyzing linked data sets. We describe the algorithm implemented in the gfs_sampler package and its statistical basis, and demonstrate its use on a sample data set.

翻译：现有文件链接方法可能产生亚最佳结果,因为它们既未考虑不同对匹配记录之间的相互作用,也未考虑一个文件所独有的变量之间的关系。此外,许多现行方法未能解决链接的不确定性,这可能导致对一个文件所独有的变量之间的关系作出过于精确的估计。巴伊斯记录链接方法可以减少在估计相关科学关系方面存在的偏差,并提供计算关联不确定性的间隔估计;然而,这些方法的实施往往很复杂,而且具有计算强度。本文章介绍了Python编程语言的gfs_Sampler软件包,使用Bayesian方法进行文件链接。Gfs_Sampler样本中的链接程序可能会导致过于精确地估计一个文件中独有的变量之间的关系。Gfs_ampler样本中的链接程序可能会导致过于精确地估计一个文件链接。算法将链接作为缺失的数据问题处理,并生成多个链接数据集。对于计算效率而言,只有链接的组合和多项分析是分别使用各种对调的。这一应用会降低链接程序的计算复杂性,从而降低了链接过程的计算复杂性,我们用其数据库中的数据集来分析。