Generative Adversarial Networks (GANs) have achieved state-of-the-art results in tabular data synthesis, under the presumption of direct accessible training data. Vertical Federated Learning (VFL) is a paradigm which allows to distributedly train machine learning model with clients possessing unique features pertaining to the same individuals, where the tabular data learning is the primary use case. However, it is unknown if tabular GANs can be learned in VFL. Demand for secure data transfer among clients and GAN during training and data synthesizing poses extra challenge. Conditional vector for tabular GANs is a valuable tool to control specific features of generated data. But it contains sensitive information from real data - risking privacy guarantees. In this paper, we propose GTV, a VFL framework for tabular GANs, whose key components are generator, discriminator and the conditional vector. GTV proposes an unique distributed training architecture for generator and discriminator to access training data in a privacy-preserving manner. To accommodate conditional vector into training without privacy leakage, GTV designs a mechanism training-with-shuffling to ensure that no party can reconstruct training data with conditional vector. We evaluate the effectiveness of GTV in terms of synthetic data quality, and overall training scalability. Results show that GTV can consistently generate high-fidelity synthetic tabular data of comparable quality to that generated by centralized GAN algorithm. The difference on machine learning utility can be as low as to 2.7%, even under extremely imbalanced data distributions across clients and different number of clients.
翻译:纵向联邦学习(VFL)是一个范例,它允许与具有与同一人有关的独特特征的客户进行分布式的机器学习模式培训,而列表数据学习是主要使用案例。然而,如果能够在VFL中学习表格式GANs的话,则尚不清楚。在培训和数据合成期间客户和GAN之间安全数据传输的需求将带来更多挑战。表格式GANs的有条件矢量是控制所产生数据具体特征的宝贵工具,但包含来自真实数据的敏感信息----冒着隐私保障的风险。在本文件中,我们提议GTV,即表格式GANs的VLF框架,其主要组成部分是生成者、歧视者和有条件矢量。GTV提出一个独特的分布式培训架构,以维护隐私的方式获取培训数据。为了在不泄露隐私的情况下将有条件的矢量纳入培训,GTV设计一个机制,同时将低效用性培训与确保任何缔约方都无法以高质量进行高质量的合成VVA值培训。