Storage-efficient privacy-preserving learning is crucial due to the increasing amounts of sensitive user data required for modern learning tasks. We propose a framework for reducing the storage cost of user data while at the same time providing privacy guarantees, without essential loss in the utility of the data for learning. Our method comprises noise injection followed by lossy compression. We show that, when appropriately matching the lossy compression to the distribution of the added noise, the compressed examples converge, in distribution, to that of the noise-free training data as the sample size of the training data (or the dimension of the training data) increases. In this sense, the utility of the data for learning is essentially maintained, while reducing storage and privacy leakage by quantifiable amounts. We present experimental results on the CelebA dataset for gender classification and find that our suggested pipeline delivers in practice on the promise of the theory: the individuals in the images are unrecognizable (or less recognizable, depending on the noise level), overall storage of the data is substantially reduced, with no essential loss (and in some cases a slight boost) to the classification accuracy. As an added bonus, our experiments suggest that our method yields a substantial boost to robustness in the face of adversarial test data.
翻译:由于现代学习任务所需的敏感用户数据数量不断增加,因此,存储高效的隐私保护学习至关重要,因为现代学习任务所需的敏感用户数据数量不断增加。我们提议了一个框架,以减少用户数据的存储成本,同时提供隐私保障,同时不造成学习数据使用方面的重大损失。我们的方法包括注入噪音,随后进行流失压缩。我们显示,当将损失压缩与增加噪音的分布适当匹配时,压缩实例在分发过程中与无噪音培训数据相融合,随着培训数据样本的大小(或培训数据的范围)的增加,数据的总体储存大大减少,分类准确性没有基本损失(有时略有提高)。从这个意义上讲,数据对学习的效用基本上是维持,同时以量化数量减少存储和隐私的泄漏。我们在用于性别分类的CelebA数据集上提出实验结果,发现我们建议的管道实际上符合理论的承诺:图像中的个人是无法辨认的(或不太可识别的,视噪音程度而定),因此数据的总体储存量已大大减少,没有基本损失(在某些情况下略有提高)。作为补充,我们的实验表明,我们的方法将极大地提升到对准度。