Marginal-based methods achieve promising performance in the synthetic data competition hosted by the National Institute of Standards and Technology (NIST). To deal with high-dimensional data, the distribution of synthetic data is represented by a probabilistic graphical model (e.g., a Bayesian network), while the raw data distribution is approximated by a collection of low-dimensional marginals. Differential privacy (DP) is guaranteed by introducing random noise to each low-dimensional marginal distribution. Despite its promising performance in practice, the statistical properties of marginal-based methods are rarely studied in the literature. In this paper, we study DP data synthesis algorithms based on Bayesian networks (BN) from a statistical perspective. We establish a rigorous accuracy guarantee for BN-based algorithms, where the errors are measured by the total variation (TV) distance or the $L^2$ distance. Related to downstream machine learning tasks, an upper bound for the utility error of the DP synthetic data is also derived. To complete the picture, we establish a lower bound for TV accuracy that holds for every $\epsilon$-DP synthetic data generator.
翻译:以边际为基础的方法在国家标准和技术研究所(NIST)主持的合成数据竞争中取得了有希望的成绩。为了处理高维数据,合成数据的分布以概率图形模型(例如巴伊西亚网络)为代表,而原始数据分布则以低维边际的收集为近似。不同的隐私(DP)是通过对每个低维边际分布采用随机噪音来保证的。尽管其实际表现有希望,但在文献中很少研究边际方法的统计特性。我们从统计角度研究基于巴伊西亚网络的DP数据合成算法(BN)。我们为基于BN的算法制定了严格的准确性保证,在这种算法上错误是以总变异(TV)距离或$L%2美元距离衡量的。与下游机器学习任务有关的是,也从DP合成数据效用错误的上限中得出。为了完成这一图象,我们为每一台$\ epslon-DP合成数据生成器设定了较低的电视精确度。