Model explanations provide transparency into a trained machine learning model's blackbox behavior to a model builder. They indicate the influence of different input attributes to its corresponding model prediction. The dependency of explanations on input raises privacy concerns for sensitive user data. However, current literature has limited discussion on privacy risks of model explanations. We focus on the specific privacy risk of attribute inference attack wherein an adversary infers sensitive attributes of an input (e.g., race and sex) given its model explanations. We design the first attribute inference attack against model explanations in two threat models where model builder either (a) includes the sensitive attributes in training data and input or (b) censors the sensitive attributes by not including them in the training data and input. We evaluate our proposed attack on four benchmark datasets and four state-of-the-art algorithms. We show that an adversary can successfully infer the value of sensitive attributes from explanations in both the threat models accurately. Moreover, the attack is successful even by exploiting only the explanations corresponding to sensitive attributes. These suggest that our attack is effective against explanations and poses a practical threat to data privacy. On combining the model predictions (an attack surface exploited by prior attacks) with explanations, we note that the attack success does not improve. Additionally, the attack success on exploiting model explanations is better compared to exploiting only model predictions. These suggest that model explanations are a strong attack surface to exploit for an adversary.
翻译:模型解释使经过训练的机器学习模型的黑盒行为透明化, 表明不同输入属性对其相应模型预测的影响。 对投入的依附性引起敏感用户数据的隐私问题。 但是, 目前文献对模型解释的隐私风险的讨论有限。 我们侧重于属性推断攻击的具体隐私风险, 敌人根据模型解释准确地推断出输入( 如种族和性别) 的敏感属性。 我们设计了两种威胁模型对模型解释的第一次属性推断攻击。 在两种模型中,模型制造者要么(a) 将敏感属性纳入培训数据和投入,要么(b) 将敏感属性纳入培训数据和投入,从而审查敏感属性,不将它们纳入培训数据和投入。 我们评估了对4个基准数据集和4种最新算法的拟议攻击。 我们显示,对手能够成功地从两种威胁模型的解释中推断出敏感属性( 如种族和性别)的敏感属性。 此外, 我们设计攻击模型的成功, 仅仅利用与敏感属性对应的解释。 这表明,我们的攻击对解释有效, 并且对数据隐私构成实际威胁。 在将模型与攻击的预测结合起来时, 我们利用之前的地面解释, 改进了对攻击的预测, 改进了对攻击的成功解释。