Many methods have been developed to understand complex predictive models and high expectations are placed on post-hoc model explainability. It turns out that such explanations are not robust nor trustworthy, and they can be fooled. This paper presents techniques for attacking Partial Dependence (plots, profiles, PDP), which are among the most popular methods of explaining any predictive model trained on tabular data. We showcase that PD can be manipulated in an adversarial manner, which is alarming, especially in financial or medical applications where auditability became a must-have trait supporting black-box models. The fooling is performed via poisoning the data to bend and shift explanations in the desired direction using genetic and gradient algorithms. To the best of our knowledge, this is the first work performing attacks on variable dependence explanations. The novel approach of using a genetic algorithm for doing so is highly transferable as it generalizes both ways: in a model-agnostic and an explanation-agnostic manner.
翻译:已经开发了许多方法来理解复杂的预测模型,对后热量模型的解释期望很高。 事实证明, 这些解释并不可靠, 也不可靠, 并且可以被愚弄。 本文介绍了攻击部分依赖性的技术( 绘图、 剖面图、 PDP ), 这是最受欢迎的解释任何关于表格数据培训的预测模型的方法之一。 我们展示了PD可以以对抗的方式被操纵,这令人震惊, 特别是在财务或医疗应用中, 审计变得必须具备支持黑盒模型的特性。 愚弄是通过用遗传和梯度算法使数据弯曲和改变对理想方向的解释来进行的。 据我们所知, 这是第一次对可变依赖性解释进行攻击。 使用基因算法来这样做的新颖的方法可以高度可转让, 因为它概括了两种方法: 以模型- 不可知性和解释- 不可知性方式。