Merging datafiles containing information on overlapping sets of entities is a challenging task in the absence of unique identifiers, and is further complicated when some entities are duplicated in the datafiles. Most approaches to this problem have focused on linking two files assumed to be free of duplicates, or on detecting which records in a single file are duplicates. However, it is common in practice to encounter scenarios that fit somewhere in between or beyond these two settings. We propose a Bayesian approach for the general setting of multifile record linkage and duplicate detection. We use a novel partition representation to propose a structured prior for partitions that can incorporate prior information about the data collection processes of the datafiles in a flexible manner, and extend previous models for comparison data to accommodate the multifile setting. We also introduce a family of loss functions to derive Bayes estimates of partitions that allow uncertain portions of the partitions to be left unresolved. The performance of our proposed methodology is explored through extensive simulations. Code implementing the methodology is available at https://github.com/aleshing/multilink .
翻译:在缺乏独特识别资料的情况下,包含重叠实体资料的合并数据档案是一项艰巨的任务,如果有些实体在数据档案中出现重复,则更为复杂。这个问题的多数方法侧重于将假定没有重复资料的两个文件联系起来,或者发现单个文件中的记录是重复的。然而,在实践中,常见的做法是遇到适合这两个环境之间或之外某处的情景。我们提议采用巴耶斯式办法,以总体设定多文件记录链接和重复检测。我们使用新版分区表示法,提出分区结构化的预示法,以灵活的方式纳入关于数据档案数据收集过程的先前信息,并将以前的比较数据模型扩大到多文件设置。我们还采用损失函数组合,得出使分区的不确定部分无法解决的海湾分区估计数。我们拟议方法的绩效通过广泛的模拟加以探讨。在https://github.com/aleshing/multlink上可以找到实施方法的代码。