While data sharing is crucial for knowledge development, privacy concerns and strict regulation (e.g., European General Data Protection Regulation (GDPR)) unfortunately limit its full effectiveness. Synthetic tabular data emerges as an alternative to enable data sharing while fulfilling regulatory and privacy constraints. The state-of-the-art tabular data synthesizers draw methodologies from generative Adversarial Networks (GAN) and address two main data types in the industry, i.e., continuous and categorical. In this paper, we develop CTAB-GAN, a novel conditional table GAN architecture that can effectively model diverse data types, including a mix of continuous and categorical variables. Moreover, we address data imbalance and long-tail issues, i.e., certain variables have drastic frequency differences across large values. To achieve those aims, we first introduce the information loss and classification loss to the conditional GAN. Secondly, we design a novel conditional vector, which efficiently encodes the mixed data type and skewed distribution of data variable. We extensively evaluate CTAB-GAN with the state of the art GANs that generate synthetic tables, in terms of data similarity and analysis utility. The results on five datasets show that the synthetic data of CTAB-GAN remarkably resembles the real data for all three types of variables and results into higher accuracy for five machine learning algorithms, by up to 17%.
翻译:虽然数据共享对于知识开发至关重要,但隐私问题和严格监管(例如欧洲一般数据保护条例(GDPR))不幸限制了数据的全面有效性。合成列表数据作为替代方法出现,在满足监管和隐私限制的同时可以进行数据共享。最先进的表格数据合成器从基因反versarial网络(GAN)中提取了方法,并针对行业中的两大数据类型,即连续和绝对数据。在本文件中,我们开发了CTAB-GAN,这是一个新的条件性表格GAN结构,能够有效地模拟不同数据类型,包括一系列连续和绝对变量的组合。此外,我们处理数据不平衡和长尾问题,即某些变量在大值之间有着巨大的频率差异。为了实现这些目标,我们首先将信息损失和分类损失引入有条件的GAN。第二,我们设计了一个新型的有条件矢量,有效地将混合数据类型和数据偏斜分布数据变量编码。我们广泛评估了CTAB-GAN,其状态是合成表格的艺术状态,从数据的精确性和精确性来看,即某些变量的频率差异在大型数值之间有巨大的频率差异。我们用五类合成数据中展示了精确性的数据的C-GAN数据结果,以显示五类的精确性数据。