Machine Learning (ML) has achieved enormous success in solving a variety of problems in computer vision, speech recognition, object detection, to name a few. The principal reason for this success is the availability of huge datasets for training deep neural networks (DNNs). However, datasets can not be publicly released if they contain sensitive information such as medical or financial records. In such cases, data privacy becomes a major concern. Encryption methods offer a possible solution to this issue, however their deployment on ML applications is non-trivial, as they seriously impact the classification accuracy and result in substantial computational overhead.Alternatively, obfuscation techniques can be used, but maintaining a good balance between visual privacy and accuracy is challenging. In this work, we propose a method to generate secure synthetic datasets from the original private datasets. In our method, given a network with Batch Normalization (BN) layers pre-trained on the original dataset, we first record the layer-wise BN statistics. Next, using the BN statistics and the pre-trained model, we generate the synthetic dataset by optimizing random noises such that the synthetic data match the layer-wise statistical distribution of the original model. We evaluate our method on image classification dataset (CIFAR10) and show that our synthetic data can be used for training networks from scratch, producing reasonable classification performance.
翻译:机器学习(ML)在解决计算机视觉、语音识别、对象检测等诸多问题方面取得了巨大成功。 成功的主要原因是为培训深神经网络(DNNS)提供了巨大的数据集。 但是,如果数据集包含诸如医疗或财务记录等敏感信息,则不能公开发布。 在这种情况下,数据隐私成为一个主要关切。 加密方法为这一问题提供了可能的解决办法,然而,在ML应用程序上的部署是非三角性的,因为它们严重影响了分类准确性并导致大量计算间接费用。 替代地,可以使用模糊技术,但在视觉隐私和准确性之间保持良好的平衡是具有挑战性的。 在这项工作中,我们建议了一种方法,从原始私人数据集中生成安全的合成数据集。 在方法中,鉴于Batch 正常化(BN) 层的网络在原始数据集上受过预先培训,我们首先记录了BN的层次统计数据。 其次,利用BN统计数据和预先培训的模型,我们通过优化随机噪音来生成合成数据,我们所使用的合成数据分类方法能够显示我们的合成数据分类。