In this paper, we propose a distributed Generative Adversarial Networks (discGANs) to generate synthetic tabular data specific to the healthcare domain. While using GANs to generate images has been well studied, little to no attention has been given to generation of tabular data. Modeling distributions of discrete and continuous tabular data is a non-trivial task with high utility. We applied discGAN to model non-Gaussian multi-modal healthcare data. We generated 249,000 synthetic records from original 2,027 eICU dataset. We evaluated the performance of the model using machine learning efficacy, the Kolmogorov-Smirnov (KS) test for continuous variables and chi-squared test for discrete variables. Our results show that discGAN was able to generate data with distributions similar to the real data.
翻译:在本文中,我们提出了一种分布式生成对抗网络(discGAN),以生成专门针对医疗保健领域的合成表格数据。虽然已经研究了使用GAN生成图像的方法,但很少有关注生成表格数据。建模离散和连续表格数据的分布是一项具有高效用途的非常艰巨的任务。我们将discGAN应用于建模非高斯多模式医疗保健数据。我们从原始的2,027个eICU数据集中生成了249,000个合成记录。我们使用机器学习技术评估模型的性能,对连续变量使用Kolmogorov-Smirnov(KS)检验,对离散变量使用卡方检验。我们的结果表明,discGAN能够生成具有与实际数据相似分布的数据。