Differential privacy allows bounding the influence that training data records have on a machine learning model. To use differential privacy in machine learning, data scientists must choose privacy parameters $(\epsilon,\delta)$. Choosing meaningful privacy parameters is key, since models trained with weak privacy parameters might result in excessive privacy leakage, while strong privacy parameters might overly degrade model utility. However, privacy parameter values are difficult to choose for two main reasons. First, the theoretical upper bound on privacy loss $(\epsilon,\delta)$ might be loose, depending on the chosen sensitivity and data distribution of practical datasets. Second, legal requirements and societal norms for anonymization often refer to individual identifiability, to which $(\epsilon,\delta)$ are only indirectly related. We transform $(\epsilon,\delta)$ to a bound on the Bayesian posterior belief of the adversary assumed by differential privacy concerning the presence of any record in the training dataset. The bound holds for multidimensional queries under composition, and we show that it can be tight in practice. Furthermore, we derive an identifiability bound, which relates the adversary assumed in differential privacy to previous work on membership inference adversaries. We formulate an implementation of this differential privacy adversary that allows data scientists to audit model training and compute empirical identifiability scores and empirical $(\epsilon,\delta)$.
翻译:不同的隐私允许限制培训数据记录对机器学习模式的影响。 为了在机器学习中使用不同的隐私, 数据科学家必须选择隐私参数$( epsilon,\delta) 。 选择有意义的隐私参数是关键, 因为经过隐私参数薄弱培训的模型可能导致过度隐私泄漏, 而强大的隐私参数可能会过分降低模型功能。 但是, 隐私参数值很难选择, 有两个主要原因。 首先, 对隐私权损失的理论上限$( epsilon,\delta) 可能松散, 取决于实际数据集选择的敏感性和数据分布。 其次, 匿名化的法律要求和社会规范往往指个人身份可识别性, 而美元( epsilon,\delta) 只是间接相关。 我们把美元( epsilon,\delta) 转换成巴耶西亚的后背信仰, 假设在培训数据集中存在任何记录存在不同的隐私权。 约束下进行多层面的询问, 并且我们表明, 在实践中, 匿名化的法律规定和社会规范可能很紧。 此外, 我们从先前的保密性研究中得出了一种不同的经验性 。