We propose a general approach to evaluating identification risk of continuous synthesized variables in partially synthetic data. We introduce the use of a radius $r$ in the construction of identification risk probability of each target record, and illustrate with working examples for one or more continuous synthesized variables. We demonstrate our methods with applications to a data sample from the Consumer Expenditure Surveys (CE), and discuss the impacts on risk and data utility of 1) the choice of radius $r$, 2) the choice of synthesized variables, and 3) the choice of number of synthetic datasets. We give recommendations for statistical agencies for synthesizing and evaluating identification risk of continuous variables. An R package is created to perform our proposed methods of identification risk evaluation, and sample R scripts are included.
翻译:我们提出了评估部分合成数据中连续合成变数的识别风险的一般方法。我们采用半径美元来构建每个目标记录的识别风险概率,并用工作实例来说明一个或多个连续合成变数。我们展示了对消费者支出调查数据样本的应用方法,并讨论了对风险和数据效用的影响:(1) 半径的选择,(2) 合成变数的选择,(3) 合成数据集数量的选择。我们建议统计机构对连续变数的识别风险进行综合和评估。我们制作了一个R包,以实施我们提议的识别风险评估方法,并包括R样本脚本。