Artificial intelligence (AI) comes with great opportunities but can also pose significant risks. Automatically generated explanations for decisions can increase transparency and foster trust, especially for systems based on automated predictions by AI models. However, given, e.g., economic incentives to create dishonest AI, to what extent can we trust explanations? To address this issue, our work investigates how AI models (i.e., deep learning, and existing instruments to increase transparency regarding AI decisions) can be used to create and detect deceptive explanations. As an empirical evaluation, we focus on text classification and alter the explanations generated by GradCAM, a well-established explanation technique in neural networks. Then, we evaluate the effect of deceptive explanations on users in an experiment with 200 participants. Our findings confirm that deceptive explanations can indeed fool humans. However, one can deploy machine learning (ML) methods to detect seemingly minor deception attempts with accuracy exceeding 80% given sufficient domain knowledge. Without domain knowledge, one can still infer inconsistencies in the explanations in an unsupervised manner, given basic knowledge of the predictive model under scrutiny.
翻译:人工智能(AI)带来巨大的机遇,但也可能构成重大风险。自动生成的决定解释可以增加透明度,促进信任,特别是基于AI模型自动预测的系统。然而,考虑到创建不诚实的AI的经济激励,我们可以在多大程度上相信解释?为了解决这个问题,我们的工作调查了AI模型(即深层学习,以及提高AI决定透明度的现有工具)如何能够用来创造和发现欺骗性解释。作为经验评估,我们侧重于文本分类,并改变GradCAM(神经网络中一种成熟的解释技术)产生的解释。然后,我们评估在200名参与者的实验中欺骗性解释对用户的影响。我们的调查结果证实,欺骗性解释确实可以愚弄人类。然而,我们可以运用机器学习(ML)方法,在有足够的领域知识的情况下,以超过80%的准确度探测表面上的欺骗企图。如果没有域知识,人们仍然可以推断出在解释中存在不统一的方式,因为对正在接受审查的预测模型具有基本知识。