Synthetic data generation becomes prevalent as a solution to privacy leakage and data shortage. Generative models are designed to generate a realistic synthetic dataset, which can precisely express the data distribution for the real dataset. The generative adversarial networks (GAN), which gain great success in the computer vision fields, are doubtlessly used for synthetic data generation. Though there are prior works that have demonstrated great progress, most of them learn the correlations in the data distributions rather than the true processes in which the datasets are naturally generated. Correlation is not reliable for it is a statistical technique that only tells linear dependencies and is easily affected by the dataset's bias. Causality, which encodes all underlying factors of how the real data be naturally generated, is more reliable than correlation. In this work, we propose a causal model named Causal Tabular Generative Neural Network (Causal-TGAN) to generate synthetic tabular data using the tabular data's causal information. Extensive experiments on both simulated datasets and real datasets demonstrate the better performance of our method when given the true causal graph and a comparable performance when using the estimated causal graph.
翻译:合成数据生成成为解决隐私泄漏和数据短缺的一种解决办法。 生成模型的目的是生成一个现实的合成数据集,能够准确地表达真实数据集的数据分布。 在计算机视觉领域取得巨大成功的基因对抗网络(GAN)无疑地用于合成数据生成。 虽然以前的一些工程已经显示出巨大的进步, 但大部分已经学会了数据分布中的相关性, 而不是数据集自然生成的真实过程。 关联性是不可靠的, 因为它是一种统计技术, 只能告诉线性依赖性, 很容易受到数据集偏差的影响。 构造性( 它将真实数据如何自然生成的所有基本因素编码起来)比关联性更可靠。 在这项工作中, 我们提出了一个名为Causal Tatural General Neural 网络( Causal- TGAN) 的因果模型, 以便使用表格数据的因果关系信息生成合成表格式数据。 模拟数据集和真实数据集的大规模实验表明,根据真实的因果关系图表和可比较性,我们的方法表现更好。