深度神经网络的解释失灵 (Foiling Explanations in Deep Neural Networks)

Deep neural networks (DNNs) have greatly impacted numerous fields over the past decade. Yet despite exhibiting superb performance over many problems, their black-box nature still poses a significant challenge with respect to explainability. Indeed, explainable artificial intelligence (XAI) is crucial in several fields, wherein the answer alone -- sans a reasoning of how said answer was derived -- is of little value. This paper uncovers a troubling property of explanation methods for image-based DNNs: by making small visual changes to the input image -- hardly influencing the network's output -- we demonstrate how explanations may be arbitrarily manipulated through the use of evolution strategies. Our novel algorithm, AttaXAI, a model-agnostic, adversarial attack on XAI algorithms, only requires access to the output logits of a classifier and to the explanation map; these weak assumptions render our approach highly useful where real-world models and data are concerned. We compare our method's performance on two benchmark datasets -- CIFAR100 and ImageNet -- using four different pretrained deep-learning models: VGG16-CIFAR100, VGG16-ImageNet, MobileNet-CIFAR100, and Inception-v3-ImageNet. We find that the XAI methods can be manipulated without the use of gradients or other model internals. Our novel algorithm is successfully able to manipulate an image in a manner imperceptible to the human eye, such that the XAI method outputs a specific explanation map. To our knowledge, this is the first such method in a black-box setting, and we believe it has significant value where explainability is desired, required, or legally mandatory.

翻译：深度神经网络自上个十年以来对许多领域产生了重大影响，但是它们黑盒子般的特性仍然对可解释性构成了巨大挑战。事实上，在一些领域，只有答案而没有推理过程毫无价值，因此可解释的人工智能是至关重要的。本文揭示了图像类深度神经网络中的解释方法的一个令人烦恼的属性：通过对输入图像进行微小的视觉更改，几乎不影响网络输出，我们展示了如何通过进化策略来任意操纵解释。我们的新算法AttaXAI是一个模型无关的、对XAI算法的对抗攻击。它只需要访问分类器的输出logits和解释映射，这些弱假设使得我们的方法非常适用于涉及现实模型和数据的情况。我们用四个不同的预训练深度学习模型：VGG16-CIFAR100、VGG16-ImageNet、MobileNet-CIFAR100和Inception-v3-ImageNet在两个基准数据集CIFAR100和ImageNet上比较了我们方法的性能。我们发现XAI方法可以在不使用梯度或其他模型内部的情况下进行操纵。我们的新算法能够成功地以对人眼不可察觉的方式操纵图像，使得XAI方法输出特定的解释映射。据我们所知，这是黑盒子环境下的首个方法，我们认为它在需要、必须或法律上要求可解释性的情况下具有重大价值。