While generation of synthetic data under differential privacy (DP) has received a lot of attention in the data privacy community, analysis of synthetic data has received much less. Existing work has shown that simply analysing DP synthetic data as if it were real does not produce valid inferences of population-level quantities. For example, confidence intervals become too narrow, which we demonstrate with a simple experiment. We tackle this problem by combining synthetic data analysis techniques from the field of multiple imputation, and synthetic data generation using noise-aware Bayesian modeling into a pipeline NA+MI that allows computing accurate uncertainty estimates for population-level quantities from DP synthetic data. To implement NA+MI for discrete data generation from marginal queries, we develop a novel noise-aware synthetic data generation algorithm NAPSU-MQ using the principle of maximum entropy. Our experiments demonstrate that the pipeline is able to produce accurate confidence intervals from DP synthetic data. The intervals become wider with tighter privacy to accurately capture the additional uncertainty stemming from DP noise.
翻译:虽然不同隐私下的合成数据的生成在数据隐私界受到了很多关注,但合成数据分析却得到的较少。现有工作表明,仅仅分析DP合成数据,就如同数据真实存在一样,并不能产生人口数量的有效推断。例如,信任间隔变得太窄,我们用简单的实验来证明这一点。我们通过将多种估算领域的合成数据分析技术以及使用有噪音的巴伊西亚模型的合成数据生成合成数据纳入管道NA+MI来解决这一问题,从而能够从DP合成数据中计算人口数量准确的不确定性估计数。为了执行用于从边缘查询中分离数据生成的NA+MI,我们采用了使用最大加密原则的新颖的有噪音的合成数据生成算法NAMSU-MQ。我们的实验表明,管道能够从DP合成数据中产生准确的信任间隔。这一间隔随着隐私的加强而变得更加宽广,以准确地捕捉到来自DP噪音的额外不确定性。