Feature based explanations, that provide importance of each feature towards the model prediction, is arguably one of the most intuitive ways to explain a model. In this paper, we establish a novel set of evaluation criteria for such feature based explanations by robustness analysis. In contrast to existing evaluations which require us to specify some way to "remove" features that could inevitably introduces biases and artifacts, we make use of the subtler notion of smaller adversarial perturbations. By optimizing towards our proposed evaluation criteria, we obtain new explanations that are loosely necessary and sufficient for a prediction. We further extend the explanation to extract the set of features that would move the current prediction to a target class by adopting targeted adversarial attack for the robustness analysis. Through experiments across multiple domains and a user study, we validate the usefulness of our evaluation criteria and our derived explanations.
翻译:以特征为基础的解释,对模型预测具有每个特征的重要性,可以说是解释模型的最直觉方法之一。在本文中,我们为这种基于特征的解释通过稳健性分析建立了一套新的评价标准。与要求我们指定某种“撤销”特征的方法以不可避免地引入偏见和人工制品的现有评价相比,我们利用较微妙的小型对抗性扰动概念。通过优化我们拟议的评价标准,我们获得新的解释,这些解释对于预测来说是不太必要和足够的。我们进一步扩展解释范围,通过采用有针对性的对抗性攻击进行稳健性分析,将目前的预测转移到目标类别。我们通过跨多个领域的试验和用户研究,验证我们的评价标准和衍生的解释的效用。