AI-based data synthesis has seen rapid progress over the last several years, and is increasingly recognized for its promise to enable privacy-respecting high-fidelity data sharing. However, adequately evaluating the quality of generated synthetic datasets is still an open challenge. We introduce and demonstrate a holdout-based empirical assessment framework for quantifying the fidelity as well as the privacy risk of synthetic data solutions for mixed-type tabular data. Measuring fidelity is based on statistical distances of lower-dimensional marginal distributions, which provide a model-free and easy-to-communicate empirical metric for the representativeness of a synthetic dataset. Privacy risk is assessed by calculating the individual-level distances to closest record with respect to the training data. By showing that the synthetic samples are just as close to the training as to the holdout data, we yield strong evidence that the synthesizer indeed learned to generalize patterns and is independent of individual training records. We demonstrate the presented framework for seven distinct synthetic data solutions across four mixed-type datasets and compare these to more traditional statistical disclosure techniques. The results highlight the need to systematically assess the fidelity just as well as the privacy of these emerging class of synthetic data generators.
翻译:在过去几年里,基于大赦国际的数据合成工作取得了迅速的进展,并日益得到承认,因为其承诺促进尊重隐私的高贞洁度数据共享;然而,充分评估所生成的合成数据集的质量仍是一个公开的挑战。我们引入并展示了一种以不懈为基础的实证评估框架,以量化可靠度以及混合型表格数据合成数据解决方案的隐私风险。衡量忠诚的基础是低维边缘分布的统计距离,为合成数据集的代表性提供了一种无模型和易于交流的经验性衡量标准。通过计算个人至最接近培训数据的记录,对隐私风险进行了评估。通过显示合成样品与持有数据的培训一样接近,我们提供了有力的证据,证明合成样品确实学会了将模式集成并独立于单个培训记录。我们展示了四个混合型数据集的七个不同合成数据解决方案的列报框架,并将这些数据与较传统的统计披露技术进行比较。结果突出表明,需要系统评估这些新兴合成数据类别是否准确性,同时评估这些合成数据的保密性。