Privacy-protecting data analysis investigates statistical methods under privacy constraints. This is a rising challenge in modern statistics, as the achievement of confidentiality guarantees, which typically occurs through suitable perturbations of the data, may determine a loss in the statistical utility of the data. In this paper, we consider privacy-protecting tests for goodness-of-fit in frequency tables, this being arguably the most common form of releasing data. Under the popular framework of $(\varepsilon,\delta)$-differential privacy for perturbed data, we introduce a private likelihood-ratio (LR) test for goodness-of-fit and we study its large sample properties, showing the importance of taking the perturbation into account to avoid a loss in the statistical significance of the test. Our main contribution provides a quantitative characterization of the trade-off between confidentiality, measured via differential privacy parameters $\varepsilon$ and $\delta$, and utility, measured via the power of the test. In particular, we establish a precise Bahadur-Rao type large deviation expansion for the power of the private LR test, which leads to: i) identify a critical quantity, as a function of the sample size and $(\varepsilon,\delta)$, which determines a loss in the power of the private LR test; ii) quantify the sample cost of $(\varepsilon,\delta)$-differential privacy in the private LR test, namely the additional sample size that is required to recover the power of the LR test in the absence of perturbation. Such a result relies on a novel multidimensional large deviation principle for sum of i.i.d. random vectors, which is of independent interest. Our work presents the first rigorous treatment of privacy-protecting LR tests for goodness-of-fit in frequency tables, making use of the power of the test to quantify the trade-off between confidentiality and utility.
翻译:保护隐私的数据分析在隐私限制下调查统计方法。 这是现代统计中日益面临的一项挑战,因为通常通过数据的适当扰动实现保密保障,可能会确定数据在统计用途方面的损失。 在本文中,我们考虑对频率表的完善性进行隐私保护测试,这可以说是发布数据的最常见形式。在受扰动数据保密的流行框架内,我们引入了一种私隐性(race-rat)测试,以利得,我们研究其大样本属性,显示将扰动纳入考虑以避免测试在统计意义方面的损失的重要性。我们的主要贡献提供了对保密之间交易的定量描述,通过不同的隐私参数 $\varepsil, 美元和 deltata$(delta) 的披露性能。我们为私隐性(ral-right) 测试的利得性(light-right) 的利得性(light-right) 的利得性(light-reval-ral) 利得,这导致将机密性测试的值调值调重(ial-ral-ral-ral) 的数值测试结果, 度测试的值的值值值值值的值值值值值值值值值值值值。