This work presents a systematic benchmark of differentially private synthetic data generation algorithms that can generate tabular data. Utility of the synthetic data is evaluated by measuring whether the synthetic data preserve the distribution of individual and pairs of attributes, pairwise correlation as well as on the accuracy of an ML classification model. In a comprehensive empirical evaluation we identify the top performing algorithms and those that consistently fail to beat baseline approaches.
翻译:这项工作为可生成表格数据的有差别的私人合成数据生成算法提供了一个系统的基准,通过衡量合成数据是否保持个人和属性的分布、对等相关性以及ML分类模型的准确性,对合成数据的效用进行评估。在一项综合经验评估中,我们确定了最高性能算法和一贯未能超过基线方法的算法。