Interpreting neural network classifiers using gradient-based saliency maps has been extensively studied in the deep learning literature. While the existing algorithms manage to achieve satisfactory performance in application to standard image recognition datasets, recent works demonstrate the vulnerability of widely-used gradient-based interpretation schemes to norm-bounded perturbations adversarially designed for every individual input sample. However, such adversarial perturbations are commonly designed using the knowledge of an input sample, and hence perform sub-optimally in application to an unknown or constantly changing data point. In this paper, we show the existence of a Universal Perturbation for Interpretation (UPI) for standard image datasets, which can alter a gradient-based feature map of neural networks over a significant fraction of test samples. To design such a UPI, we propose a gradient-based optimization method as well as a principal component analysis (PCA)-based approach to compute a UPI which can effectively alter a neural network's gradient-based interpretation on different samples. We support the proposed UPI approaches by presenting several numerical results of their successful applications to standard image datasets.
翻译:在利用基于梯度的显著特征地图解释神经网络分类器时,在深层学习文献中已进行了广泛研究;虽然现有的算法设法在应用标准图像识别数据集方面达到令人满意的性能,但最近的工作表明,广泛使用的基于梯度的判读办法对每个输入样本都容易受到为每个输入样本而以对立方式设计的受规范的扰动干扰;然而,这种对抗性扰动通常是利用输入样本知识设计的,因此对一个未知或不断变化的数据点应用亚优化应用。在本文中,我们展示了标准图像数据集的通用判读(UPI)功能,这可以改变大量测试样品的神经网络基于梯度的特征图。为设计这种基于梯度的优化方法和主要组成部分分析(PCA)法,以计算一个能够有效改变神经网络对不同样品的梯度解释的UPI。我们支持拟议的 UPI方法,方法是在标准图像数据集中提出几个成功应用的数字结果。