Until recently, multiple synthetic data sets were always released to analysts, to allow valid inferences to be obtained. However, under certain conditions - including when saturated count models are used to synthesize categorical data - single imputation ($m=1$) is sufficient. Nevertheless, increasing $m$ causes utility to improve, but at the expense of higher risk, an example of the risk-utility trade-off. The question, therefore, is: which value of $m$ is optimal with respect to the risk-utility trade-off? Moreover, the paper considers two ways of analysing categorical data sets: as they have a contingency table representation, multiple categorical data sets can be averaged before being analysed, as opposed to the usual way of averaging post-analysis. This paper also introduces a pair of metrics, $\tau_3(k,d)$ and $\tau_4(k,d)$, that are suited for assessing disclosure risk in multiple categorical synthetic data sets. Finally, the synthesis methods are demonstrated empirically.
翻译:直到最近,一直向分析人员发放多种合成数据集,以便获得有效的推论;然而,在某些条件下,包括在使用饱和计数模型来综合绝对数据时,单推算(m=1美元)就足够了;不过,增加百万美元有助于改进,但以牺牲较高风险为代价,这是风险效用交易的一个实例。因此,问题是:在风险效用交易方面,哪些百万美元是最佳的?此外,本文件考虑了分析绝对数据集的两种方法:由于它们有应急表代表,在分析前可以平均使用多个绝对数据集,而不是通常的平均分析后方法。本文还介绍了一套适用于评估多种绝对合成数据集披露风险的量($tau_3,k,d)和$\tau_4,k,d)美元。最后,综合方法得到了经验的证明。