Non-independent and identically distributed (non-IID) data is a key challenge in federated learning (FL), which usually hampers the optimization convergence and the performance of FL. Existing data augmentation methods based on federated generative models or raw data sharing strategies for solving the non-IID problem still suffer from low performance, privacy protection concerns, and high communication overhead in decentralized tabular data. To tackle these challenges, we propose a federated tabular data augmentation method, named Fed-TDA. The core idea of Fed-TDA is to synthesize tabular data for data augmentation using some simple statistics (e.g., distributions of each column and global covariance). Specifically, we propose the multimodal distribution transformation and inverse cumulative distribution mapping respectively synthesize continuous and discrete columns in tabular data from a noise according to the pre-learned statistics. Furthermore, we theoretically analyze that our Fed-TDA not only preserves data privacy but also maintains the distribution of the original data and the correlation between columns. Through extensive experiments on five real-world tabular datasets, we demonstrate the superiority of Fed-TDA over the state-of-the-art in test performance and communication efficiency.
翻译:以联合基因模型或原始数据共享战略为基础的现有数据增强方法,以解决非二元问题,仍然因为工作表现不佳、隐私保护问题和分散式表格数据中的通信管理费用高而受到影响。为了应对这些挑战,我们提议了一个称为Fed-TDA的联盟式表格数据增强方法。Fed-TDA的核心想法是利用一些简单的统计数据(例如每栏的分布和全球变量)为数据增强而综合表格数据,以扩大数据。具体地说,我们提议采用多式联运分布转换和逆向累积分布图,分别根据事先掌握的统计数据,将表格中连续和独立的列数据从噪音中综合起来。此外,我们从理论上分析我们的Fed-TDA不仅维护数据隐私,而且还保持原始数据的分布以及各栏之间的相互关系。我们通过对五个真实世界的表格数据集进行广泛试验,展示了Fed-TDA在测试性业绩和通信效率方面优于状态。