PEOPL: Characterizing Privately Encoded Open Datasets with Public Labels (PEOPL: Characterizing Privately Encoded Open Datasets with Public Labels)

Homa Esfahanizadeh,Adam Yala,Rafael G. L. D'Oliveira,Andrea J. D. Jaba,Victor Quach,Ken R. Duffy,Tommi S. Jaakkola,Vinod Vaikuntanathan,Manya Ghobadi,Regina Barzilay,Muriel Médard

from arxiv, Submitted to IEEE Transactions on Information Forensics and Security

Allowing organizations to share their data for training of machine learning (ML) models without unintended information leakage is an open problem in practice. A promising technique for this still-open problem is to train models on the encoded data. Our approach, called Privately Encoded Open Datasets with Public Labels (PEOPL), uses a certain class of randomly constructed transforms to encode sensitive data. Organizations publish their randomly encoded data and associated raw labels for ML training, where training is done without knowledge of the encoding realization. We investigate several important aspects of this problem: We introduce information-theoretic scores for privacy and utility, which quantify the average performance of an unfaithful user (e.g., adversary) and a faithful user (e.g., model developer) that have access to the published encoded data. We then theoretically characterize primitives in building families of encoding schemes that motivate the use of random deep neural networks. Empirically, we compare the performance of our randomized encoding scheme and a linear scheme to a suite of computational attacks, and we also show that our scheme achieves competitive prediction accuracy to raw-sample baselines. Moreover, we demonstrate that multiple institutions, using independent random encoders, can collaborate to train improved ML models.

翻译：PEOPL：用公共标签表征私密编码开放数据集。让组织能够共享数据，以用于机器学习（ML）模型的训练，而不会意外泄露信息仍是一个实际中有待解决的难题。这个问题的一个有前途的技术是在编码数据上进行训练。我们的方法称为具有公共标签的私密编码开放数据集（PEOPL），使用特定类别的随机生成的变换对敏感数据进行编码。组织发布其随机编码的数据和相关的原始标签，用于 ML 训练，在没有编码实现知识的情况下进行训练。我们调查了这个问题的几个重要方面：我们引入了隐私和效用的信息论分数，用以量化具有发布编码数据访问权限的不忠实用户 (例如，对手) 和忠实用户 (例如，模型开发者) 的平均性能。然后我们从理论上表征构建编码方案族的原语，这些原语鼓励使用随机深度神经网络。从经验角度出发，我们将我们的随机编码方案和一个线性方案与一系列计算攻击进行比较，并且我们还展示我们的方案与原始样本基线具有可比的预测准确性。此外，我们证明了多个机构使用独立的随机编码器可以合作训练出更好的 ML 模型。