With the development of machine learning and data science, data sharing is very common between companies and research institutes to avoid data scarcity. However, sharing original datasets that contain private information can cause privacy leakage. A reliable solution is to utilize private synthetic datasets which preserve statistical information from original datasets. In this paper, we propose MC-GEN, a privacy-preserving synthetic data generation method under differential privacy guarantee for machine learning classification tasks. MC-GEN applies multi-level clustering and differential private generative model to improve the utility of synthetic data. In the experimental evaluation, we evaluated the effects of parameters and the effectiveness of MC-GEN. The results showed that MC-GEN can achieve significant effectiveness under certain privacy guarantees on multiple classification tasks. Moreover, we compare MC-GEN with three existing methods. The results showed that MC-GEN outperforms other methods in terms of utility.
翻译:随着机器学习和数据科学的发展,公司和研究机构之间数据共享非常常见,以避免数据稀缺;然而,共享含有私人信息的原始数据集可能导致隐私泄漏;一个可靠的解决办法是利用私营合成数据集,保存原始数据集中的统计资料;在本文件中,我们提议采用MC-GEN,即根据对机器学习分类任务的不同隐私保障而采用的一种保护隐私的合成数据生成方法。MC-GEN采用多层次集群和差异私人基因化模型,以提高合成数据的效用。在实验评估中,我们评估了参数的效果和MC-GEN的有效性。结果显示,MC-GEN在某些隐私保障下可以在多重分类任务上取得显著成效。此外,我们将MC-GEN与三种现有方法进行比较。结果显示,MC-GEN在效用方面优于其他方法。