As machine learning models are increasingly used in critical decision-making settings (e.g., healthcare, finance), there has been a growing emphasis on developing methods to explain model predictions. Such \textit{explanations} are used to understand and establish trust in models and are vital components in machine learning pipelines. Though explanations are a critical piece in these systems, there is little understanding about how they are vulnerable to manipulation by adversaries. In this paper, we discuss how two broad classes of explanations are vulnerable to manipulation. We demonstrate how adversaries can design biased models that manipulate model agnostic feature attribution methods (e.g., LIME \& SHAP) and counterfactual explanations that hill-climb during the counterfactual search (e.g., Wachter's Algorithm \& DiCE) into \textit{concealing} the model's biases. These vulnerabilities allow an adversary to deploy a biased model, yet explanations will not reveal this bias, thereby deceiving stakeholders into trusting the model. We evaluate the manipulations on real world data sets, including COMPAS and Communities \& Crime, and find explanations can be manipulated in practice.
翻译:由于机器学习模型越来越多地用于关键的决策环境(例如保健、金融),人们越来越强调制定解释模型预测的方法。这种 \ textit{explanation} 被用来理解和建立对模型的信任,并且是机器学习管道中的重要组成部分。 虽然解释是这些系统中的一个关键部分,但是对于它们如何容易被对手操纵,我们很少理解它们是如何容易被对手操纵的。在本文中,我们讨论了两大类解释是如何容易被操纵的。我们证明对手如何设计有偏见的模式来操纵模型的特征归属方法(例如,LIME ⁇ ZHAP)和反事实解释,即在反事实搜索(例如,Wachter's Algorithm ⁇ dicacE)期间,山丘对模型偏移到\ textit{concaling} 偏向。这些弱点使得对手能够部署偏向模型,但解释不会揭示这种偏向,从而让利益攸关方相信模型。我们评估了真实世界数据集的操纵,包括COMAS和Crimeal } 和在实践中可以找到解释。