Most statistical agencies release randomly selected samples of Census microdata, usually with sample fractions under 10% and with other forms of statistical disclosure control (SDC) applied. An alternative to SDC is data synthesis, which has been attracting growing interest, yet there is no clear consensus on how to measure the associated utility and disclosure risk of the data. The ability to produce synthetic Census microdata, where the utility and associated risks are clearly understood, could mean that more timely and wider-ranging access to microdata would be possible. This paper follows on from previous work by the authors which mapped synthetic Census data on a risk-utility (R-U) map. The paper presents a framework to measure the utility and disclosure risk of synthetic data by comparing it to samples of the original data of varying sample fractions, thereby identifying the sample fraction which has equivalent utility and risk to the synthetic data. Three commonly used data synthesis packages are compared with some interesting results. Further work is needed in several directions but the methodology looks very promising.
翻译:多数统计机构随机公布人口普查微观数据的样本,通常样本碎片不到10%,并采用其他形式的统计披露控制(SDC),而SDC的替代办法是数据综合,这引起了越来越多的兴趣,但对于如何衡量数据的相关效用和披露风险,尚没有明确的共识。制作综合普查微观数据的能力,在这些数据的效用和相关风险得到明确理解的情况下,可能意味着有可能更及时和更广泛地获得微观数据。本文件是参照作者以前的工作编写的,其中绘制了综合普查数据,绘制了风险效用图(R-U),该文件提供了一个框架,通过将合成数据与不同样品部分原始数据的样本进行比较,衡量合成数据的效用和披露风险,从而确定与合成数据具有同等效用和风险的样本部分。三个常用的数据综合包与一些有趣的结果进行了比较。需要从几个方向进一步开展工作,但方法看起来很有希望。