Protecting data privacy is paramount in the fields such as finance, banking, and healthcare. Federated Learning (FL) has attracted widespread attention due to its decentralized, distributed training and the ability to protect the privacy while obtaining a global shared model. However, FL presents challenges such as communication overhead, and limited resource capability. This motivated us to propose a two-stage federated learning approach toward the objective of privacy protection, which is a first-of-its-kind study as follows: (i) During the first stage, the synthetic dataset is generated by employing two different distributions as noise to the vanilla conditional tabular generative adversarial neural network (CTGAN) resulting in modified CTGAN, and (ii) In the second stage, the Federated Probabilistic Neural Network (FedPNN) is developed and employed for building globally shared classification model. We also employed synthetic dataset metrics to check the quality of the generated synthetic dataset. Further, we proposed a meta-clustering algorithm whereby the cluster centers obtained from the clients are clustered at the server for training the global model. Despite PNN being a one-pass learning classifier, its complexity depends on the training data size. Therefore, we employed a modified evolving clustering method (ECM), another one-pass algorithm to cluster the training data thereby increasing the speed further. Moreover, we conducted sensitivity analysis by varying Dthr, a hyperparameter of ECM at the server and client, one at a time. The effectiveness of our approach is validated on four finance and medical datasets.
翻译:隐私保护是金融、银行和医疗健康等领域中至关重要的问题。联邦学习(FL)由于其分散式、分布式的学习和在保护隐私的同时获得全局共享模型的能力而受到广泛关注。然而,FL存在通信开销和有限的资源能力等挑战。我们的研究工作基于联邦学习提出了一种两阶段的联邦式学习方法,旨在保护隐私,是首次进行的研究。具体而言:(i) 在第一阶段中,我们利用两种不同的分布作为噪声,对基于条件表格的生成对抗神经网络(CTGAN)进行修改。以此生成合成数据集,并(ii) 在第二阶段中,开发了并采用了联邦概率神经网络(FedPNN)用于构建全局共享的分类模型。我们还使用了合成数据集的度量指标检查了生成的合成数据集的质量。此外,我们提出了一种元聚类算法,其中客户端获得的聚类中心在服务器上聚类,以训练全局模型。尽管PNN是一种一次性学习分类器,但它的复杂度取决于训练数据大小。因此,我们采用了修改的演进聚类方法(ECM),这是另一种一次性算法,可对训练数据进行聚类,进一步提高速度。此外,我们通过单独调整服务器和客户端的超参数Dthr来进行灵敏度分析。 我们的方法的有效性在四个金融和医学数据集上进行了验证。