Objective: To design and evaluate a general framework for interactive record linkage using a convenient algorithm combined with tractable Human Intelligent Tasks (HITs; i.e. micro tasks requiring human judgment) that can support reproducible data science. Materials and Methods: Accurate linkage of real data requires both automatic processing of well-defined tasks and human processing of tasks that require human judgment (i.e., HITs) on messy data. We present a reproducible, interactive, and iterative framework for record linkage called VIEW (Visual Interactive Entity-resolution Workbench). We implemented and evaluated VIEW by integrating two commonly used hospital databases, the American Hospital Association (AHA) Annual Survey of Hospitals and the Medicare Cost Reports for Hospitals from CMS. Results: Using VIEW to iteratively standardize and clean the data, we linked all Texas hospitals common in both databases with 100% precision by confirming 78 approximate linkages using HITs and manually linking 28 hospitals using HITs. Discussion: Similarities in hospital names and addresses and the dynamic nature of hospital attributes over time make it impossible to build a fully automated linkage system for hospitals that can be maintained over time. VIEW is a software that supports a reproducible semi-automated process that can generate and track HITs to be reviewed and linked manually for messy data elements such as hospitals that have been merged. Conclusion: Effective software that can support the interactive and iterative process of record linkage, and well-designed HITs can streamline the linkage processes to support high quality replicable research using messy real data.
翻译:目标:设计并评价互动记录链接总框架,使用一种方便的算法,结合可移动的人类智能任务(HIT;即需要人类判断的微观任务)来设计和评价互动记录链接总框架,这些任务可以支持可复制的数据科学。 材料和方法:真实数据的准确链接既需要自动处理明确界定的任务,也需要人工处理需要人类判断(即HIT)混杂数据的任务。我们为记录链接提供了一个可复制、互动和迭接的框架,称为VIEW(虚拟互动实体分辨率工作)。我们实施和评估了VIEW, 整合了两个常用的医院数据库,即美国医院协会(AHA)、医院年度调查以及CMS医院《Medicare成本报告》。 结果:利用VIEW对数据进行迭代标准化和清理,我们将两个数据库中常见的所有得克萨斯医院都以100%的精确度连接,确认使用HIT和手动连接28家医院。 讨论:医院名称和地址的相似性和动态性,使得无法为医院建立完全自动化的链接系统,从而能够对HIT进行真正的实时数据记录。