Lossy 压缩嘈杂数据用于隐私和数据高效学习 (Lossy Compression of Noisy Data for Private and Data-Efficient Learning)

from arxiv, Published at the IEEE Journal on Selected Areas in Information Theory (JSAIT). Preliminary version was presented at the IEEE International Symposium on Information Theory (ISIT), 2022, with a slightly different title, "Learning under Storage and Privacy Constraints."

Storage-efficient privacy-preserving learning is crucial due to increasing amounts of sensitive user data required for modern learning tasks. We propose a framework for reducing the storage cost of user data while at the same time providing privacy guarantees, without essential loss in the utility of the data for learning. Our method comprises noise injection followed by lossy compression. We show that, when appropriately matching the lossy compression to the distribution of the added noise, the compressed examples converge, in distribution, to that of the noise-free training data as the sample size of the training data (or the dimension of the training data) increases. In this sense, the utility of the data for learning is essentially maintained, while reducing storage and privacy leakage by quantifiable amounts. We present experimental results on the CelebA dataset for gender classification and find that our suggested pipeline delivers in practice on the promise of the theory: the individuals in the images are unrecognizable (or less recognizable, depending on the noise level), overall storage of the data is substantially reduced, with no essential loss (and in some cases a slight boost) to the classification accuracy. As an added bonus, our experiments suggest that our method yields a substantial boost to robustness in the face of adversarial test data.

翻译：存储高效的隐私保护学习对于现代学习任务所需的敏感用户数据越来越重要。我们提出了一种框架，通过注入噪声后进行失真压缩来降低用户数据的存储成本，同时提供隐私保证，而数据的效用不会重大损失。我们表明，当将失真压缩适当匹配加入噪声的分布时，随着训练数据的样本量（或训练数据的维度）的增加，压缩的示例在分布上收敛到无噪声训练数据的示例。在这个意义上，数据对于学习的效用基本被维持，同时通过可量化的量减少了存储和隐私泄露。我们在 CelebA 数据集上进行了性别分类的实验，并发现我们提出的流水线在实践中实现了理论的承诺：图像中的个人是不可识别的（或更不可识别，取决于噪声水平），数据的总体存储大大减少，而分类准确性没有重大损失（在某些情况下，略有提高）。作为额外的奖励，我们的实验表明，我们的方法在面对敌对测试数据时显著提高了稳健性。