Synthetic data has been advertised as a silver-bullet solution to privacy-preserving data publishing that addresses the shortcomings of traditional anonymisation techniques. The promise is that synthetic data drawn from generative models preserves the statistical properties of the original dataset but, at the same time, provides perfect protection against privacy attacks. In this work, we present the first quantitative evaluation of the privacy gain of synthetic data publishing and compare it to that of previous anonymisation techniques. Our evaluation of a wide range of state-of-the-art generative models demonstrates that synthetic data either does not prevent inference attacks or does not retain data utility. In other words, we empirically show that synthetic data does not provide a better tradeoff between privacy and utility than traditional anonymisation techniques. Furthermore, in contrast to traditional anonymisation, the privacy-utility tradeoff of synthetic data publishing is hard to predict. Because it is impossible to predict what signals a synthetic dataset will preserve and what information will be lost, synthetic data leads to a highly variable privacy gain and unpredictable utility loss. In summary, we find that synthetic data is far from the holy grail of privacy-preserving data publishing.
翻译:合成数据被公诸于世,作为解决传统匿名技术缺陷的隐私保护数据出版的银球解决方案,被公诸于世。其前景是,从基因模型中提取的合成数据保存了原始数据集的统计特性,但同时也提供了完美的保护,防止了隐私受到攻击。在这项工作中,我们首次对合成数据出版的私隐收益进行了定量评估,并将其与先前的匿名技术进行比较。我们对各种最新基因化模型的评估表明,合成数据既不能防止推断攻击,也不能保留数据效用。换句话说,我们从经验上表明,合成数据并没有比传统匿名技术更好地平衡隐私和效用。此外,与传统的匿名化技术相比,合成数据出版的私隐效用交易很难预测。由于无法预测什么是合成数据集的信号,哪些信息会丢失,合成数据会导致高度变异的私隐收益和不可预测的公用损失。简而言之,我们发现合成数据远非隐私保护数据出版的神圣结构。