Privacy and transparency are two key foundations of trustworthy machine learning. Model explanations offer insights into a model's decisions on input data, whereas privacy is primarily concerned with protecting information about the training data. We analyze connections between model explanations and the leakage of sensitive information about the model's training set. We investigate the privacy risks of feature-based model explanations using membership inference attacks: quantifying how much model predictions plus their explanations leak information about the presence of a datapoint in the training set of a model. We extensively evaluate membership inference attacks based on feature-based model explanations, over a variety of datasets. We show that backpropagation-based explanations can leak a significant amount of information about individual training datapoints. This is because they reveal statistical information about the decision boundaries of the model about an input, which can reveal its membership. We also empirically investigate the trade-off between privacy and explanation quality, by studying the perturbation-based model explanations.
翻译:隐私和透明度是值得信赖的机器学习的两个关键基础。模型解释提供了对模型关于投入数据决定的洞察力,而隐私则主要与保护培训数据的信息有关。我们分析了模型解释与模型培训成套敏感信息泄漏之间的联系。我们用成员推理攻击调查基于地物的模型解释的隐私风险:量化模型预测及其解释在模型培训集中存在数据点的泄漏信息的数量。我们根据基于地物的模型解释,广泛评价成员对各种数据集的推断攻击。我们表明,基于反向分析的解释可以泄露大量关于单个培训数据点的信息。这是因为这些解释揭示了关于模型决定界限的统计资料,可以显示其成员身份。我们还通过研究基于扰动模型的解释,对隐私与解释质量之间的权衡进行了实证性调查。