Privacy-preserving data analysis is emerging as a challenging problem with far-reaching impact. In particular, synthetic data are a promising concept toward solving the aporetic conflict between data privacy and data sharing. Yet, it is known that accurately generating private, synthetic data of certain kinds is NP-hard. We develop a statistical framework for differentially private synthetic data, which enables us to circumvent the computational hardness of the problem. We consider the true data as a random sample drawn from a population Omega according to some unknown density. We then replace Omega by a much smaller random subset Omega^*, which we sample according to some known density. We generate synthetic data on the reduced space Omega^* by fitting the specified linear statistics obtained from the true data. To ensure privacy we use the common Laplacian mechanism. Employing the concept of Renyi condition number, which measures how well the sampling distribution is correlated with the population distribution, we derive explicit bounds on the privacy and accuracy provided by the proposed method.
翻译:保护隐私的数据分析正在成为一个具有挑战性且影响深远的问题。 特别是,合成数据是解决数据隐私和数据共享之间极端冲突的一个大有希望的概念。 然而,众所周知,准确生成某些种类的私人合成数据是很硬的。 我们为差别化的私人合成数据开发了一个统计框架,这使我们能够绕过问题的计算难度。 我们认为真实数据是根据某些未知密度从Omega人口中随机抽取的样本。 然后我们用一个小得多的随机子集Omega ⁇ 取代Omega。 我们根据某些已知密度进行取样。 我们通过匹配从真实数据中获得的指定线性统计数据来生成关于缩小空间的合成数据 Omega ⁇ 。 为确保隐私,我们使用共同的 Laplacecian 机制。 使用Renyi 条件编号的概念,用以衡量抽样分布与人口分布的关联程度。 我们从拟议方法提供的隐私和准确性上得出明确的界限。