特征属性和反事实解释可以被操纵 (Feature Attributions and Counterfactual Explanations Can Be Manipulated)

As machine learning models are increasingly used in critical decision-making settings (e.g., healthcare, finance), there has been a growing emphasis on developing methods to explain model predictions. Such \textit{explanations} are used to understand and establish trust in models and are vital components in machine learning pipelines. Though explanations are a critical piece in these systems, there is little understanding about how they are vulnerable to manipulation by adversaries. In this paper, we discuss how two broad classes of explanations are vulnerable to manipulation. We demonstrate how adversaries can design biased models that manipulate model agnostic feature attribution methods (e.g., LIME \& SHAP) and counterfactual explanations that hill-climb during the counterfactual search (e.g., Wachter's Algorithm \& DiCE) into \textit{concealing} the model's biases. These vulnerabilities allow an adversary to deploy a biased model, yet explanations will not reveal this bias, thereby deceiving stakeholders into trusting the model. We evaluate the manipulations on real world data sets, including COMPAS and Communities \& Crime, and find explanations can be manipulated in practice.

翻译：由于机器学习模型越来越多地用于关键的决策环境(例如保健、金融),人们越来越强调制定解释模型预测的方法。这种 \ textit{explanation} 被用来理解和建立对模型的信任,并且是机器学习管道中的重要组成部分。虽然解释是这些系统中的一个关键部分,但是对于它们如何容易被对手操纵,我们很少理解它们是如何容易被对手操纵的。在本文中,我们讨论了两大类解释是如何容易被操纵的。我们证明对手如何设计有偏见的模式来操纵模型的特征归属方法(例如,LIME ⁇ ZHAP)和反事实解释,即在反事实搜索(例如,Wachter's Algorithm ⁇ dicacE)期间,山丘对模型偏移到\ textit{concaling} 偏向。这些弱点使得对手能够部署偏向模型,但解释不会揭示这种偏向,从而让利益攸关方相信模型。我们评估了真实世界数据集的操纵,包括COMAS和Crimeal } 和在实践中可以找到解释。

相关内容

MoDELS

关注 43

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/