The widespread adoption of electronic health records (EHRs) and subsequent increased availability of longitudinal healthcare data has led to significant advances in our understanding of health and disease with direct and immediate impact on the development of new diagnostics and therapeutic treatment options. However, access to EHRs is often restricted due to their perceived sensitive nature and associated legal concerns, and the cohorts therein typically are those seen at a specific hospital or network of hospitals and therefore not representative of the wider population of patients. Here, we present HealthGen, a new approach for the conditional generation of synthetic EHRs that maintains an accurate representation of real patient characteristics, temporal information and missingness patterns. We demonstrate experimentally that HealthGen generates synthetic cohorts that are significantly more faithful to real patient EHRs than the current state-of-the-art, and that augmenting real data sets with conditionally generated cohorts of underrepresented subpopulations of patients can significantly enhance the generalisability of models derived from these data sets to different patient populations. Synthetic conditionally generated EHRs could help increase the accessibility of longitudinal healthcare data sets and improve the generalisability of inferences made from these data sets to underrepresented populations.
翻译:广泛采用电子健康记录(EHRs)以及随后增加纵向保健数据的提供,使我们对健康和疾病的理解大有进展,对新的诊断和治疗方案的发展产生了直接和直接的影响,然而,由于人们所认为的敏感性质和相关的法律关切,获得EHR的机会往往受到限制,而其中的组群一般是在特定医院或医院网络上看到的,因此不代表更广泛的病人人口。在这里,我们介绍了HealthGen,这是有条件生成合成EHR的新方法,保持了真实病人特征、时间信息和失踪模式的准确代表性。我们实验性地表明,Healgen生成的合成组群对实际病人EHR的忠实程度大大高于目前的状况,用有条件生成的病人人数不足的组群来增加实际数据集可以大大提高从这些数据集中得出的模型对不同病人人口的普遍性。 有条件生成的合成EHRs可以帮助增加纵向保健数据集的可获取性,并改进从这些数据集中推断出的人口代表性的普遍性。