Due to patient privacy protection concerns, machine learning research in healthcare has been undeniably slower and limited than in other application domains. High-quality, realistic, synthetic electronic health records (EHRs) can be leveraged to accelerate methodological developments for research purposes while mitigating privacy concerns associated with data sharing. The current state-of-the-art model for synthetic EHR generation is generative adversarial networks, which are notoriously difficult to train and can suffer from mode collapse. Denoising Diffusion Probabilistic Models, a class of generative models inspired by statistical thermodynamics, have recently been shown to generate high-quality synthetic samples in certain domains. It is unknown whether these can generalize to generation of large-scale, high-dimensional EHRs. In this paper, we present a novel generative model based on diffusion models that is the first successful application on electronic health records. Our model proposes a mechanism to perform class-conditional sampling to preserve label information. We also introduce a new sampling strategy to accelerate the inference speed. We empirically show that our model outperforms existing state-of-the-art synthetic EHR generation methods.
翻译:由于病人的隐私保护问题,保健方面的机器学习研究无疑比其他应用领域慢,而且有限;可以利用高质量、现实、合成电子健康记录(EHRs)来加速研究方法的发展,同时减轻与数据共享有关的隐私关切;目前合成EHR一代的最新先进模型是基因化对抗网络,这种网络在培训上极为困难,并可能因模式崩溃而受害;最近显示,由统计热力学启发的基因化模型类别 -- -- 一种由统计热力学启发的基因化模型 -- -- 在某些领域产生高质量的合成样本;不清楚这些记录能否概括为大规模、高维度的合成健康记录(EHRs)的生成;在本文件中,我们提出了一个基于传播模型的新型基因化模型,这是电子健康记录的首次成功应用;我们的模式提议了一个进行等级条件抽样的机制,以保存标签信息;我们还采用了一种新的取样战略,以加速推断速度;我们从经验上表明,我们的模型比现有的先进合成HR新一代方法要差。