用于隐私保护机器学习的合成数据集生成 (Synthetic Dataset Generation for Privacy-Preserving Machine Learning)

Machine Learning (ML) has achieved enormous success in solving a variety of problems in computer vision, speech recognition, object detection, to name a few. The principal reason for this success is the availability of huge datasets for training deep neural networks (DNNs). However, datasets cannot be publicly released if they contain sensitive information such as medical records, and data privacy becomes a major concern. Encryption methods could be a possible solution, however their deployment on ML applications seriously impacts classification accuracy and results in substantial computational overhead. Alternatively, obfuscation techniques could be used, but maintaining a good trade-off between visual privacy and accuracy is challenging. In this paper, we propose a method to generate secure synthetic datasets from the original private datasets. Given a network with Batch Normalization (BN) layers pretrained on the original dataset, we first record the class-wise BN layer statistics. Next, we generate the synthetic dataset by optimizing random noise such that the synthetic data match the layer-wise statistical distribution of original images. We evaluate our method on image classification datasets (CIFAR10, ImageNet) and show that synthetic data can be used in place of the original CIFAR10/ImageNet data for training networks from scratch, producing comparable classification performance. Further, to analyze visual privacy provided by our method, we use Image Quality Metrics and show high degree of visual dissimilarity between the original and synthetic images. Moreover, we show that our proposed method preserves data-privacy under various privacy-leakage attacks including Gradient Matching Attack, Model Memorization Attack, and GAN-based Attack.

翻译：机器学习(ML)在解决计算机视觉、语音识别、对象检测等诸多问题方面取得了巨大成功。成功的主要原因是为培训深神经网络( DNNS)提供了巨大的数据集。但是,如果数据集包含诸如医疗记录等敏感信息,而数据隐私则成为一个主要关切, 则无法公开发布。加密方法可能是一个可能的解决办法, 但其在 ML 应用程序上的部署会严重影响分类的准确性, 并导致大量计算管理。或者, 可以使用迷惑技术, 但要在视觉隐私和准确性之间保持良好的平衡是具有挑战性的。在本文中,我们提出从原始的私人数据集中产生安全的合成数据集的方法。但是,如果数据集包含医疗记录等敏感信息, 并且数据保密性, 我们首先记录了班级 BN 层统计。下一步, 我们通过优化随机噪音来生成合成数据, 从而使得合成数据与原始图像的层次统计分布相匹配, 我们的图像分类方法( CIFAR10, 图像网) 并显示从原始的视觉攻击到可比较性图像网络, 我们从原始的原始的图像分析方法, 显示我们使用的原始的原始的图像分析方法, 我们的原始的原始的图像分析方法, 我们的原始的原始的原始的原始的图像分析, 显示我们用来用来用来用来展示的原始的原始的原始的原始的原始的原始的图像分析。