Performing randomized response (RR) over multi-dimensional data is subject to the curse of dimensionality. As the number of attributes increases, the exponential growth in the number of attribute-value combinations greatly impacts the computational cost and the accuracy of the RR estimates. In this paper, we propose a new multi-dimensional RR scheme that randomizes all attributes independently, and then aggregates these randomization matrices into a single aggregated matrix. The multi-dimensional joint probability distributions are then estimated. The inverse matrix of the aggregated randomization matrix can be computed efficiently at a lightweight computation cost (i.e., linear with respect to dimensionality) and with manageable storage requirements. To overcome the limitation of accuracy, we propose two extensions to the baseline protocol, called {\em hybrid} and {\em truncated} schemes. Finally, we have conducted experiments using synthetic and major open-source datasets for various numbers of attributes, domain sizes, and numbers of respondents. The results using UCI Adult dataset give average distances between the estimated and the real (2 through 6-way) joint probability are $0.0099$ for {\em truncated} and $0.0155$ for {\em hybrid} schemes, whereas they are $0.03$ and $0.04$ for LoPub, which is the state-of-the-art multi-dimensional LDP scheme.
翻译:对多维数据进行随机响应(RR)是受维度诅咒的。随着属性数量的增加,属性-价值组合数量的指数性增长极大地影响了计算成本和RR估计数的准确性。在本文中,我们提议一个新的多维RR计划,将所有属性单独随机化,然后将这些随机化矩阵汇总到一个单一的汇总矩阵中。然后对多维联合概率分布进行估算。综合随机化矩阵的逆矩阵可以以轻量计算成本(即,相对于维度而言的线性)和可管理的存储要求来有效计算。为了克服精确性的限制,我们提议了基准协议的两个扩展,称为 em 混合 和 emtraced} 计划。 最后,我们用合成和主要的开放源数据集对各种属性、 域大小和答复者的数量进行了实验。 使用UCI 成人数据集的结果显示估计值和真实值( 2- 6-way) 联合概率是0.0099美元, 而对于 0.0 美元- 美元的混合计划是 0.0 美元 和 美元- 美元- 美元- 美元- 美元- 美元- 美元- 美元- 美元- 美元- 美元- 美元- 美元- 的混合- 的组合- 和 美元- 美元- 美元- 美元- 美元- 美元- 和 美元- 美元- 美元- 美元- 美元- 美元- 美元-