Monumental advancements in artificial intelligence (AI) have lured the interest of doctors, lenders, judges, and other professionals. While these high-stakes decision-makers are optimistic about the technology, those familiar with AI systems are wary about the lack of transparency of its decision-making processes. Perturbation-based post hoc explainers offer a model agnostic means of interpreting these systems while only requiring query-level access. However, recent work demonstrates that these explainers can be fooled adversarially. This discovery has adverse implications for auditors, regulators, and other sentinels. With this in mind, several natural questions arise - how can we audit these black box systems? And how can we ascertain that the auditee is complying with the audit in good faith? In this work, we rigorously formalize this problem and devise a defense against adversarial attacks on perturbation-based explainers. We propose algorithms for the detection (CAD-Detect) and defense (CAD-Defend) of these attacks, which are aided by our novel conditional anomaly detection approach, KNN-CAD. We demonstrate that our approach successfully detects whether a black box system adversarially conceals its decision-making process and mitigates the adversarial attack on real-world data for the prevalent explainers, LIME and SHAP.
翻译:人工智能(AI)的古迹进步吸引了医生、放款人、法官和其他专业人士的兴趣。 这些高层决策者对技术持乐观态度,而那些熟悉AI系统的人则对其决策过程缺乏透明度持谨慎态度。 以扰动为基础的临时解释者提供了解释这些系统的模范不可知性手段,而只是要求查询级别访问。 但是,最近的工作表明,这些解释者可能被欺骗。 这一发现对审计员、监管者和其他哨点产生了不利影响。 有鉴于此,出现了一些自然问题――我们如何审计这些黑盒系统?我们如何确定审计对象是否忠实地遵守了审计? 在这项工作中,我们严格地将这一问题正式化,并设计一种防御机制,防止对以扰动为基础的解释者进行对抗性攻击。我们提出了这些攻击的检测(CAD-检测)和防御(CAD-Defend)的算法,这些算法得到了我们创新的有条件异常检测方法KNNN-CAD的帮助。我们证明,我们如何确定审计对象遵守了这些黑盒系统? 我们证明我们如何成功地查明其普遍的对抗性攻击系统是否隐藏了全球数据系统。