Synthetic data has gained significant momentum thanks to sophisticated machine learning tools that enable the synthesis of high-dimensional datasets. However, many generation techniques do not give the data controller control over what statistical patterns are captured, leading to concerns over privacy protection. While synthetic records are not linked to a particular real-world individual, they can reveal information about users indirectly which may be unacceptable for data owners. There is thus a need to empirically verify the privacy of synthetic data -- a particularly challenging task in high-dimensional data. In this paper we present a general framework for synthetic data generation that gives data controllers full control over which statistical properties the synthetic data ought to preserve, what exact information loss is acceptable, and how to quantify it. The benefits of the approach are that (1) one can generate synthetic data that results in high utility for a given task, while (2) empirically validating that only statistics considered safe by the data curator are used to generate the data. We thus show the potential for synthetic data to be an effective means of releasing confidential data safely, while retaining useful information for analysts.
翻译:由于先进的机器学习工具使高维数据集得以合成,合成数据获得了显著的动力。然而,许多生成技术并没有使数据控制器能够控制统计模式的收集,从而引起对隐私保护的关切。合成记录虽然没有与某个特定的现实个人相联系,但可以间接披露用户信息,而数据拥有者可能无法接受这些信息。因此,需要以经验方式核实合成数据的隐私 -- -- 这是高维数据中一项特别具有挑战性的任务。在本文件中,我们提出了一个合成数据生成总框架,使数据控制器能够充分控制合成数据应保存哪些统计属性,哪些准确的信息损失可以接受,以及如何量化这些数据。这种方法的好处是:(1) 合成记录能够产生合成数据,对某项任务非常有用,而(2) 实证,只有数据保管员认为安全的统计数据才用于生成数据。因此,我们展示合成数据有可能成为安全释放机密数据的有效手段,同时保留分析员的有用信息。