While generation of synthetic data under differential privacy (DP) has received a lot of attention in the data privacy community, analysis of synthetic data has received much less. Existing work has shown that simply analysing DP synthetic data as if it were real does not produce valid inferences of population-level quantities. For example, confidence intervals become too narrow, which we demonstrate with a simple experiment. We tackle this problem by combining synthetic data analysis techniques from the field of multiple imputation (MI), and synthetic data generation using noise-aware (NA) Bayesian modeling into a pipeline NA+MI that allows computing accurate uncertainty estimates for population-level quantities from DP synthetic data. To implement NA+MI for discrete data generation using the values of marginal queries, we develop a novel noise-aware synthetic data generation algorithm NAPSU-MQ using the principle of maximum entropy. Our experiments demonstrate that the pipeline is able to produce accurate confidence intervals from DP synthetic data. The intervals become wider with tighter privacy to accurately capture the additional uncertainty stemming from DP noise.
翻译:虽然在数据隐私(DP)下生成的合成数据在数据隐私(DP)方面受到了很多注意,但对合成数据的分析却远没有那么受到重视,现有的工作表明,仅仅分析DP合成数据,就好像它真实存在一样,并不能产生人口数量的有效推断。例如,信任间隔变得太窄,我们用简单的实验来证明这一点。我们通过将多种估算(MI)领域的合成数据分析技术以及使用蜂巢(NA)建模合成数据生成的合成数据纳入一个管道NA+MI,从而能够从DP合成数据中计算出人口数量准确的不确定性估计值。为了使用边际查询值对离散数据生成进行NA+MI,我们开发了一种新的噪音合成数据生成算法(NAAPSU-MQ),采用最大加密原则。我们的实验表明,管道能够从DP合成数据中产生准确的信任间隔。时间越长,隐私越近,可以准确捕捉DP噪音产生的额外不确定性。</s>