The release of synthetic data generated from a model estimated on the data helps statistical agencies disseminate respondent-level data with high utility and privacy protection. Motivated by the challenge of disseminating sensitive variables containing geographic information in the Consumer Expenditure Surveys (CE) at the U.S. Bureau of Labor Statistics, we propose two non-parametric Bayesian models as data synthesizers for the county identifier of each data record: a Bayesian latent class model and a Bayesian areal model. Both data synthesizers use Dirichlet Process priors to cluster observations of similar characteristics and allow borrowing information across observations. We develop innovative disclosure risks measures to quantify inherent risks in the confidential CE data and how those data risks are ameliorated by our proposed synthesizers. By creating a lower bound and an upper bound of disclosure risks under a minimum and a maximum disclosure risks scenarios respectively, our proposed inherent risks measures provide a range of acceptable disclosure risks for evaluating risks level in the synthetic datasets.
翻译:数据估计模型产生的合成数据的发布有助于统计机构传播高用途和隐私保护度的应答数据,由于在美国劳工统计局消费者支出调查(CE)中传播含有地理信息的敏感变量的挑战,我们提出两种非参数的巴伊西亚模型,作为各州数据记录识别特征的数据合成器:巴伊西亚潜值级模型和巴伊西亚等值模型,两个数据合成器都使用Drichlet进程,先于对类似特征的群集观测,然后允许在各种观测中借用信息。我们制定了创新的披露风险措施,以量化保密的CE数据内在风险,以及我们提议的合成器如何减轻这些数据风险。通过在最低和最大披露风险情景下分别设定较低的披露风险约束和上限。我们提议的固有风险措施为评估合成数据集的风险水平提供了可接受的披露风险范围。