Nowadays, machine learning is one of the most common technology to turn raw data into useful information in scientific and industrial processes. The performance of the machine learning model often depends on the size of dataset. Companies and research institutes usually share or exchange their data to avoid data scarcity. However, sharing original datasets that contain private information can cause privacy leakage. Utilizing synthetic datasets which have similar characteristics as a substitute is one of the solutions to avoid the privacy issue. Differential privacy provides a strong privacy guarantee to protect the individual data records which contain sensitive information. We propose MC-GEN, a privacy-preserving synthetic data generation method under differential privacy guarantee for multiple classification tasks. MC-GEN builds differentially private generative models on the multi-level clustered data to generate synthetic datasets. Our method also reduced the noise introduced from differential privacy to improve the utility. In experimental evaluation, we evaluated the parameter effect of MC-GEN and compared MC-GEN with three existing methods. Our results showed that MC-GEN can achieve significant effectiveness under certain privacy guarantees on multiple classification tasks.
翻译:目前,机器学习是将原始数据转化为科学和工业流程中的有用信息的最常用技术之一。机器学习模式的性能往往取决于数据集的规模。公司和研究机构通常共享或交换数据以避免数据稀缺。然而,共享含有私人信息的原始数据集可能会造成隐私泄漏。利用具有类似特征的合成数据集作为替代,是避免隐私问题的解决方案之一。不同的隐私为保护包含敏感信息的个人数据记录提供了强有力的隐私保障。我们提议了MC-GEN,这是在多种分类任务的不同隐私保障下保护隐私的合成数据生成方法。MC-GEN在多层次的集群数据上建立差别化的私人基因化模型,以生成合成数据集。我们的方法还减少了从不同隐私中引入的噪音,以改善使用。在实验评估中,我们评估了MC-GEN的参数效应,并将MC-GEN与三种现有方法作了比较。我们的结果表明,MC-GEN可以在多种分类任务的某些隐私保障下取得显著的成效。