Monumental advancements in artificial intelligence (AI) have lured the interest of doctors, lenders, judges, and other professionals. While these high-stakes decision-makers are optimistic about the technology, those familiar with AI systems are wary about the lack of transparency of its decision-making processes. Perturbation-based post hoc explainers offer a model agnostic means of interpreting these systems while only requiring query-level access. However, recent work demonstrates that these explainers can be fooled adversarially. This discovery has adverse implications for auditors, regulators, and other sentinels. With this in mind, several natural questions arise - how can we audit these black box systems? And how can we ascertain that the auditee is complying with the audit in good faith? In this work, we rigorously formalize this problem and devise a defense against adversarial attacks on perturbation-based explainers. We propose algorithms for the detection (CAD-Detect) and defense (CAD-Defend) of these attacks, which are aided by our novel conditional anomaly detection approach, KNN-CAD. We demonstrate that our approach successfully detects whether a black box system adversarially conceals its decision-making process and mitigates the adversarial attack on real-world data for the prevalent explainers, LIME and SHAP.
翻译:在人工智能(AI)方面取得了重大进展,这吸引了医生、贷款人、法官和其他专业人士的兴趣。尽管这些决策者对技术充满乐观,但熟悉AI系统的人对其决策过程缺乏透明度持谨慎态度。基于摄动的事后解释器提供了一种模型不可知的手段来解释这些系统,同时只需要查询级别的访问。然而,最近的工作表明,这些解释器可以通过敌对算法来欺骗。这一发现对审计员、监管机构和其他监察人员产生了不利影响。在此背景下,几个自然问题浮现出来,如何审计这些黑匣子系统?如何确定受审计者是否以善意遵守审计?在这项工作中,我们严格规范了这个问题,并设计了一种针对基于摄动的解释器的敌对攻击的防御措施。我们提出了CAD-Detect和CAD-Defend的检测和防御算法,并借助我们的新型条件异常检测方法KNN-CAD。我们展示了我们的方法成功检测到黑匣子系统是否故意隐藏其决策过程,并减轻了对LIME和SHAP等普遍解释器的真实世界数据的对抗性攻击。