While generation of synthetic data under differential privacy (DP) has received a lot of attention in the data privacy community, analysis of synthetic data has received much less. Existing work has shown that simply analysing DP synthetic data as if it were real does not produce valid inferences of population-level quantities. For example, confidence intervals become too narrow, which we demonstrate with a simple experiment. We tackle this problem by combining synthetic data analysis techniques from the field of multiple imputation (MI), and synthetic data generation using noise-aware (NA) Bayesian modeling into a pipeline NA+MI that allows computing accurate uncertainty estimates for population-level quantities from DP synthetic data. To implement NA+MI for discrete data generation from marginal queries, we develop a novel noise-aware synthetic data generation algorithm NAPSU-MQ using the principle of maximum entropy. Our experiments demonstrate that the pipeline is able to produce accurate confidence intervals from DP synthetic data. The intervals become wider with tighter privacy to accurately capture the additional uncertainty stemming from DP noise.
翻译:虽然在数据隐私方面生成的合成数据在数据隐私(DP)方面受到了很多注意,但合成数据分析却得到的远远少。现有工作表明,仅仅分析DP合成数据,就如同它真实存在一样,并不能产生人口数量的有效推论。例如,信任间隔变得太窄,我们用简单的实验来证明这一点。我们通过将多种估算领域的合成数据分析技术(MI)和利用蜂巢(NA)建模合成数据生成的合成数据(合成数据生成)结合成一个管道NA+MI(NA+MI)来计算来自DP合成数据的人口数量准确的不确定性估计值)来解决这个问题。为了在利用边际查询的离散数据生成中应用NA+MI,我们开发了一种新的有噪音的合成数据生成算法NAPASU-MQ。我们的实验表明,通过多种估算(NAPSU-MQ)的合成数据分析技术,能够从DP合成数据中产生准确的信任间隔。这种间隔时间越近,隐私越广,以准确捕捉摸到由DP噪音产生的额外不确定性。