利用不同的私人条件性 GAN 改进在生成平衡数据时的关联抓取 (Improving Correlation Capture in Generating Imbalanced Data using Differentially Private Conditional GANs)

Despite the remarkable success of Generative Adversarial Networks (GANs) on text, images, and videos, generating high-quality tabular data is still under development owing to some unique challenges such as capturing dependencies in imbalanced data, optimizing the quality of synthetic patient data while preserving privacy. In this paper, we propose DP-CGANS, a differentially private conditional GAN framework consisting of data transformation, sampling, conditioning, and networks training to generate realistic and privacy-preserving tabular data. DP-CGANS distinguishes categorical and continuous variables and transforms them to latent space separately. Then, we structure a conditional vector as an additional input to not only presents the minority class in the imbalanced data, but also capture the dependency between variables. We inject statistical noise to the gradients in the networking training process of DP-CGANS to provide a differential privacy guarantee. We extensively evaluate our model with state-of-the-art generative models on three public datasets and two real-world personal health datasets in terms of statistical similarity, machine learning performance, and privacy measurement. We demonstrate that our model outperforms other comparable models, especially in capturing dependency between variables. Finally, we present the balance between data utility and privacy in synthetic data generation considering the different data structure and characteristics of real-world datasets such as imbalance variables, abnormal distributions, and sparsity of data.

翻译：尽管Generation Adversarial Network(GANs)在文本、图像和视频方面取得了显著成功,但由于一些独特的挑战,例如获取数据不平衡的依存性,优化合成病人数据的质量,同时保护隐私等,生成高质量的表格式数据的工作仍在进行之中。在本文件中,我们提议DP-CGANS(DP-CGANS),这是一个有差别的私人有条件GAN框架,由数据转换、取样、调节和网络培训组成,以生成现实和隐私保存的表格数据。DP-CGANS(GANs)区分了绝对和持续的变量,并将它们分别转换为潜在的空间。然后,我们构建一个有条件的矢量,作为补充投入,不仅显示数据不平衡的少数类别,而且还反映变量之间的依赖性。我们在DP-CGANS的联网培训过程中向梯度注入了统计噪音,以提供差异性隐私保障。我们广泛评价我们的模型,在三个公共数据集和两个真实世界个人健康数据集中,在统计相似性、机器学习业绩和隐私测量方面,我们展示了模型超越了其他可比较性数据结构,我们最后将数据作为衡量的模型,在合成数据生成中,在数据生成数据中,我们考虑不同依赖性数据结构数据结构的平衡中则考虑了其他可比较性数据。