Differential privacy allows bounding the influence that training data records have on a machine learning model. To use differential privacy in machine learning, data scientists must choose privacy parameters $(\epsilon,\delta)$. Choosing meaningful privacy parameters is key since models trained with weak privacy parameters might result in excessive privacy leakage, while strong privacy parameters might overly degrade model utility. However, privacy parameter values are difficult to choose for two main reasons. First, the upper bound on privacy loss $(\epsilon,\delta)$ might be loose, depending on the chosen sensitivity and data distribution of practical datasets. Second, legal requirements and societal norms for anonymization often refer to individual identifiability, to which $(\epsilon,\delta)$ are only indirectly related. %Prior work has proposed membership inference adversaries to guide the choice of $(\epsilon,\delta)$. However, these adversaries are weaker than the adversary assumed by differential privacy and cannot empirically reach the upper bounds on privacy loss defined by $(\epsilon,\delta)$. Therefore, no quantification of a membership inference attack holds the exact meaning that $(\epsilon,\delta)$ does. We transform $(\epsilon,\delta)$ to a bound on the Bayesian posterior belief of the adversary assumed by differential privacy concerning the presence of any record in the training dataset. The bound holds for multidimensional queries under composition, and we show that it can be tight in practice. Furthermore, we derive an identifiability bound, which relates the adversary assumed in differential privacy to previous work on membership inference adversaries. We formulate an implementation of this differential privacy adversary that allows data scientists to audit model training and compute empirical identifiability scores and empirical $(\epsilon,\delta)$.
翻译:不同的隐私允许限制培训数据记录对机器学习模式的影响。 为了在机器学习中使用不同的隐私, 数据科学家必须选择隐私参数$( epsilon,\delta) 。 选择有意义的隐私参数是关键, 因为通过隐私参数薄弱培训的模型可能导致过度隐私泄漏, 而强大的隐私参数参数可能会过度降低模型功能。 但是, 隐私参数值很难选择, 两个主要原因。 首先, 有关隐私损失的上限$( epsilon,\delta) 可能松散, 取决于实际数据集的选定敏感度和数据分布。 其次, 匿名构成的法律要求和社会规范往往指个人的识别性( $( epsilon,\delta) 。 因此, 将( effilentility) 的保密性定义( liveraltial deliversation) 和( excialdivial devial devial) excial deview exliversal exliverations dede a exlievental develop ex. we disal dedeal dedeal dedeal disal.