Storage-efficient privacy-preserving learning is crucial due to increasing amounts of sensitive user data required for modern learning tasks. We propose a framework for reducing the storage cost of user data while at the same time providing privacy guarantees, without essential loss in the utility of the data for learning. Our method comprises noise injection followed by lossy compression. We show that, when appropriately matching the lossy compression to the distribution of the added noise, the compressed examples converge, in distribution, to that of the noise-free training data as the sample size of the training data (or the dimension of the training data) increases. In this sense, the utility of the data for learning is essentially maintained, while reducing storage and privacy leakage by quantifiable amounts. We present experimental results on the CelebA dataset for gender classification and find that our suggested pipeline delivers in practice on the promise of the theory: the individuals in the images are unrecognizable (or less recognizable, depending on the noise level), overall storage of the data is substantially reduced, with no essential loss (and in some cases a slight boost) to the classification accuracy. As an added bonus, our experiments suggest that our method yields a substantial boost to robustness in the face of adversarial test data.
翻译:存储高效的隐私保护学习由于现代学习任务所需的敏感用户数据的增加而变得至关重要。我们提出了一种框架,可以减少用户数据的存储成本,同时提供隐私保证,而不会对学习数据的效用造成实质性的损失。我们的方法包括注入噪声,然后进行损失压缩。我们证明,在将有损压缩与添加噪声的分布适当匹配时,随着训练数据的样本大小(或训练数据的维数)的增加,压缩后的示例在分布上收敛到无噪声训练数据的分布。在这种意义上,数据用于学习的效用基本保持不变,同时减少存储和隐私泄露的数量。我们在CelebA数据集上进行了性别分类的实验,并发现我们建议的流水线在实际中实现了理论的承诺:图像中的个人不可识别(或者说识别度较低,取决于噪声水平),整个数据的存储量大大减少,分类准确率没有实质性的损失(在某些情况下甚至略有提高)。作为额外的好处,我们的实验表明,我们的方法在面对对抗性测试数据时可以显著提高鲁棒性。