合成数据中隐私风险量化统一框架 (A Unified Framework for Quantifying Privacy Risk in Synthetic Data)

Synthetic data is often presented as a method for sharing sensitive information in a privacy-preserving manner by reproducing the global statistical properties of the original data without disclosing sensitive information about any individual. In practice, as with other anonymization methods, privacy risks cannot be entirely eliminated. The residual privacy risks need instead to be ex-post assessed. We present Anonymeter, a statistical framework to jointly quantify different types of privacy risks in synthetic tabular datasets. We equip this framework with attack-based evaluations for the singling out, linkability, and inference risks, the three key indicators of factual anonymization according to the European General Data Protection Regulation (GDPR). To the best of our knowledge, we are the first to introduce a coherent and legally aligned evaluation of these three privacy risks for synthetic data, and to design privacy attacks which model directly the singling out and linkability risks. We demonstrate the effectiveness of our methods by conducting an extensive set of experiments that measure the privacy risks of data with deliberately inserted privacy leakages, and of synthetic data generated with and without differential privacy. Our results highlight that the three privacy risks reported by our framework scale linearly with the amount of privacy leakage in the data. Furthermore, we observe that synthetic data exhibits the lowest vulnerability against linkability, indicating one-to-one relationships between real and synthetic data records are not preserved. Finally, we demonstrate quantitatively that Anonymeter outperforms existing synthetic data privacy evaluation frameworks both in terms of detecting privacy leaks, as well as computation speed. To contribute to a privacy-conscious usage of synthetic data, we open source Anonymeter at https://github.com/statice/anonymeter.

翻译：合成数据往往被作为一种方法,通过在不披露任何个人的敏感信息的情况下,复制原始数据的全球统计属性,从而以隐私保护方式分享敏感信息。实际上,像其他匿名方法一样,隐私风险无法完全消除。其余隐私风险需要事后评估。我们提出了Anonyter,这是一个统计框架,用以在合成表格数据集中共同量化不同类型的隐私风险。我们为这一框架提供了攻击性评价,用于点名、连通性和推断风险,根据欧洲数据保护总条例(GDPR),实际匿名的三个关键指标。根据我们的知识,我们首先对合成数据的这三种隐私风险进行一致和法律一致的评估,而设计隐私攻击,直接模拟合成表格外出和连通风险。我们通过进行一系列广泛的实验,测量数据隐私故意插入的隐私渗漏和合成数据生成的隐私风险。我们的结果突出表明,我们框架报告的三种隐私风险是真实匿名性(GDPRR),我们最了解的是合成数据使用速度的准确性,我们从一个合成数据流流数据到一个数据流到一个数据流到一个数据流到一个数据流中的数据流中。