Differential privacy allows bounding the influence that training data records have on a machine learning model. To use differential privacy in machine learning, data scientists must choose privacy parameters $(\epsilon,\delta)$. Choosing meaningful privacy parameters is key since models trained with weak privacy parameters might result in excessive privacy leakage, while strong privacy parameters might overly degrade model utility. However, privacy parameter values are difficult to choose for two main reasons. First, the upper bound on privacy loss $(\epsilon,\delta)$ might be loose, depending on the chosen sensitivity and data distribution of practical datasets. Second, legal requirements and societal norms for anonymization often refer to individual identifiability, to which $(\epsilon,\delta)$ are only indirectly related. We transform $(\epsilon,\delta)$ to a bound on the Bayesian posterior belief of the adversary assumed by differential privacy concerning the presence of any record in the training dataset. The bound holds for multidimensional queries under composition, and we show that it can be tight in practice. Furthermore, we derive an identifiability bound, which relates the adversary assumed in differential privacy to previous work on membership inference adversaries. We formulate an implementation of this differential privacy adversary that allows data scientists to audit model training and compute empirical identifiability scores and empirical $(\epsilon,\delta)$.
翻译:不同的隐私允许限制培训数据记录对机器学习模式的影响。 为了在机器学习中使用不同的隐私, 数据科学家必须选择隐私参数$( epsilon,\delta) 。 选择有意义的隐私参数是关键, 因为经过隐私参数薄弱培训的模型可能导致过度隐私泄漏, 而强大的隐私参数可能会过分降低模型功能。 但是, 隐私参数值很难选择, 原因有两个主要。 首先, 对隐私权损失的上限$( epsilon,\delta) 可能松散, 取决于实际数据集的敏感性和数据分配。 其次, 匿名的法律规定和社会规范往往指个人身份, 而对于个人身份的识别, 美元( epsilon,\delta) 仅仅是间接相关。 我们把美元( epsilon,\delta) 变成贝亚的隐私权后背信仰, 假设在培训数据集中存在任何记录的隐私差异。 约束着多层面的查询, 并且我们表明, 在实践中, 匿名性的法律要求通常是指个人身份识别能力。 此外, 我们从先前的保密性研究中得出了一种风险性测试数据。