Sharing electronic health records (EHRs) on a large scale may lead to privacy intrusions. Recent research has shown that risks may be mitigated by simulating EHRs through generative adversarial network (GAN) frameworks. Yet the methods developed to date are limited because they 1) focus on generating data of a single type (e.g., diagnosis codes), neglecting other data types (e.g., demographics, procedures or vital signs) and 2) do not represent constraints between features. In this paper, we introduce a method to simulate EHRs composed of multiple data types by 1) refining the GAN model, 2) accounting for feature constraints, and 3) incorporating key utility measures for such generation tasks. Our analysis with over $770,000$ EHRs from Vanderbilt University Medical Center demonstrates that the new model achieves higher performance in terms of retaining basic statistics, cross-feature correlations, latent structural properties, feature constraints and associated patterns from real data, without sacrificing privacy.
翻译:最近的研究表明,通过基因对抗网络(GAN)框架模拟EHR可以减轻风险,然而,迄今为止制定的方法是有限的,因为它们1 侧重于生成单一类型的数据(例如诊断代码),忽视其他类型的数据(例如人口、程序或生命迹象)和2,并不代表各种特征之间的制约。在本文件中,我们采用了一种方法来模拟由多种数据类型组成的EHR(1) 改进GAN模式,2 说明特征制约,3 包括这类生成任务的主要实用措施。我们用范德尔比尔特大学医疗中心提供的超过770 000美元 EHRs的分析表明,新模式在保留基本统计数据、跨功能关联、潜在结构特性、特征制约和相关模式方面,在不牺牲隐私的前提下,在保留基本统计数据、跨功能关联、潜在结构特性、特征制约和相关模式方面,取得了更高的绩效。