Clinical data usually cannot be freely distributed due to their highly confidential nature and this hampers the development of machine learning in the healthcare domain. One way to mitigate this problem is by generating realistic synthetic datasets using generative adversarial networks (GANs). However, GANs are known to suffer from mode collapse thus creating outputs of low diversity. This lowers the quality of the synthetic healthcare data, and may cause it to omit patients of minority demographics or neglect less common clinical practices. In this paper, we extend the classic GAN setup with an additional variational autoencoder (VAE) and include an external memory to replay latent features observed from the real samples to the GAN generator. Using antiretroviral therapy for human immunodeficiency virus (ART for HIV) as a case study, we show that our extended setup overcomes mode collapse and generates a synthetic dataset that accurately describes severely imbalanced class distributions commonly found in real-world clinical variables. In addition, we demonstrate that our synthetic dataset is associated with a very low patient disclosure risk, and that it retains a high level of utility from the ground truth dataset to support the development of downstream machine learning algorithms.
翻译:临床数据通常由于高度机密性而无法自由分配,这阻碍了医疗领域机器学习的发展。缓解这一问题的一个办法是利用基因对抗网络(GANs)生成现实的合成数据集。然而,已知全球免疫网络因模式崩溃而受损,从而产生低多样性的产出。这降低了合成保健数据的质量,并可能导致它忽略少数人口群体患者或忽视较不常见的临床做法。在本文中,我们扩展了经典GAN设置,增加了一个变异自动编码器(VAE),并包含一个外部记忆,将从真实样本中观察到的潜在特征重现到GAN生成器(GAN)中。使用抗逆转录病毒疗法(ART for HIV)作为案例研究,我们表明我们扩大的设置克服了模式崩溃,并生成了一个合成数据集,准确描述了在现实世界临床变量中常见的高度不平衡的班级分布。此外,我们证明我们的合成数据集与非常低的病人披露风险相关联,并保留了高水平的外部记忆,用于支持下游机器学习算法的发展。