Interpretability is crucial to understand the inner workings of deep neural networks (DNNs) and many interpretation methods generate saliency maps that highlight parts of the input image that contribute the most to the prediction made by the DNN. In this paper we design a backdoor attack that alters the saliency map produced by the network for an input image only with injected trigger that is invisible to the naked eye while maintaining the prediction accuracy. The attack relies on injecting poisoned data with a trigger into the training data set. The saliency maps are incorporated in the penalty term of the objective function that is used to train a deep model and its influence on model training is conditioned upon the presence of a trigger. We design two types of attacks: targeted attack that enforces a specific modification of the saliency map and untargeted attack when the importance scores of the top pixels from the original saliency map are significantly reduced. We perform empirical evaluation of the proposed backdoor attacks on gradient-based and gradient-free interpretation methods for a variety of deep learning architectures. We show that our attacks constitute a serious security threat when deploying deep learning models developed by untrusty sources. Finally, in the Supplement we demonstrate that the proposed methodology can be used in an inverted setting, where the correct saliency map can be obtained only in the presence of a trigger (key), effectively making the interpretation system available only to selected users.
翻译:解释性对于理解深神经网络(DNNs)的内部运行过程至关重要,许多解释方法产生突出的地图,显示输入图像中最有助于DNN所作的预测的部分内容。在本文中,我们设计了一种幕后攻击,改变网络为输入图像而制作的突出地图,但只有注入触发器才能对肉眼看不见,同时保持预测的准确性。攻击依靠将有毒数据注入到培训数据集中触发器中。突出的地图被纳入了用于训练深模型的客观功能的惩罚性术语中,其对于模型培训的影响取决于是否有触发器。我们设计了两种类型的攻击:有针对性攻击,在原始突出地图上顶部像素的分数显著下降时,对突出的打击进行具体修改,而没有针对性的攻击。我们对提议的梯度和无梯度解释方法进行实证评估,对于各种深层学习结构,我们的攻击在部署由不可靠的用户开发的深深层次学习模型时构成了严重的安全威胁。最后,我们设计了两种类型的攻击:有针对性的攻击,对突出的地图进行具体的修改,在地图中,我们只能使用一个有效的解释方法。