Privacy-protected microdata are often the desired output of a differentially private algorithm since microdata is familiar and convenient for downstream users. However, there is a statistical price for this kind of convenience. We show that an uncertainty principle governs the trade-off between accuracy for a population of interest ("sum query") vs. accuracy for its component sub-populations ("point queries"). Compared to differentially private query answering systems that are not required to produce microdata, accuracy can degrade by a logarithmic factor. For example, in the case of pure differential privacy, without the microdata requirement, one can provide noisy answers to the sum query and all point queries while guaranteeing that each answer has squared error $O(1/\epsilon^2)$. With the microdata requirement, one must choose between allowing an additional $\log^2(d)$ factor ($d$ is the number of point queries) for some point queries or allowing an extra $O(d^2)$ factor for the sum query. We present lower bounds for pure, approximate, and concentrated differential privacy. We propose mitigation strategies and create a collection of benchmark datasets that can be used for public study of this problem.
翻译:保护隐私的微观数据往往是不同私人算法的预期产出,因为微观数据对下游用户来说是熟悉和方便的。然而,这种方便是有统计价格的。我们表明,不确定原则是利益群体准确性(“和查询”)与其组成部分子人口准确性(“点查询”)之间的权衡。与不要求制作微观数据的不同私人查询回答系统相比,准确性可以通过对数系数降解。例如,在纯粹的差别隐私的情况下,如果没有微数据要求,可以对总查询和所有点查询提供响亮的答案,同时保证每个答案都有正方差差1 O(1/\\ epsilon2)美元。根据微观数据要求,必须选择允许为某些点查询增加1美元(d)乘数(d)美元是点查询次数,或者允许增加1美元(d ⁇ 2)乘数。例如,在纯粹的保密性方面,我们为纯度、近似和集中的隐私提出了较低的界限。我们建议了缓解战略,并收集了可用于公共研究的基准数据集。