Synthetic data generation has recently gained widespread attention as a more reliable alternative to traditional data anonymization. The involved methods are originally developed for image synthesis. Hence, their application to the typically tabular and relational datasets from healthcare, finance and other industries is non-trivial. While substantial research has been devoted to the generation of realistic tabular datasets, the study of synthetic relational databases is still in its infancy. In this paper, we combine the variational autoencoder framework with graph neural networks to generate realistic synthetic relational databases. We then apply the obtained method to two publicly available databases in computational experiments. The results indicate that real databases' structures are accurately preserved in the resulting synthetic datasets, even for large datasets with advanced data types.
翻译:最近,合成数据生成作为传统数据匿名化的更可靠替代方法,最近得到了广泛的关注。所涉方法最初是为图像合成而开发的。因此,这些方法在典型的保健、金融和其他行业的表格和相关数据集中的应用是非三重的。虽然已经对产生现实的表格数据集进行了大量研究,但合成关系数据库的研究仍处于初级阶段。在本文件中,我们将变式自动编码框架与图形神经网络结合起来,以生成现实的合成关系数据库。我们随后在计算实验中将所获得的方法应用于两个公开的数据库。结果显示,实际数据库的结构准确保存在由此产生的合成数据集中,甚至保存在具有先进数据类型的大型数据集中。