Deep neural networks (DNNs) are known to be vulnerable to adversarial examples that are crafted with imperceptible perturbations, i.e., a small change in an input image can induce a mis-classification, and thus threatens the reliability of deep learning based deployment systems. Adversarial training (AT) is often adopted to improve robustness through training a mixture of corrupted and clean data. However, most of AT based methods are ineffective in dealing with transferred adversarial examples which are generated to fool a wide spectrum of defense models, and thus cannot satisfy the generalization requirement raised in real-world scenarios. Moreover, adversarially training a defense model in general cannot produce interpretable predictions towards the inputs with perturbations, whilst a highly interpretable robust model is required by different domain experts to understand the behaviour of a DNN. In this work, we propose a novel approach based on Jacobian norm and Selective Input Gradient Regularization (J-SIGR), which suggests the linearized robustness through Jacobian normalization and also regularizes the perturbation-based saliency maps to imitate the model's interpretable predictions. As such, we achieve both the improved defense and high interpretability of DNNs. Finally, we evaluate our method across different architectures against powerful adversarial attacks. Experiments demonstrate that the proposed J-SIGR confers improved robustness against transferred adversarial attacks, and we also show that the predictions from the neural network are easy to interpret.
翻译:深心神经网络(DNNS)众所周知,很容易受到以无法察觉的干扰而形成的对抗性例子的伤害,即输入图像的微小变化可能导致错误分类,从而威胁到深学习部署系统的可靠性。 反向培训往往通过训练腐败和干净的数据来提高稳健性。然而,基于AT的大多数方法在处理被转移的对抗性例子方面是无效的,这些例子产生来愚弄广泛的防御模式,因此无法满足现实世界情景中所提出的一般化要求。此外,对输入图像的微小变化可能诱发一种错误分类,从而威胁到深思熟虑的深度部署系统的可靠性。 而不同的领域专家需要一种高度可解释的稳健型模型来理解DNNN的行为。 在这项工作中,我们提出了一种以雅各布规范和选择性投入强化正规化(J-SIGR)为基础的新方法,这表明通过雅各布的正常化而直线化的稳健性强性,并且也使基于透视透视的精确度地图,从模拟的网络攻击的易读取的可解释性预测性预测。我们最后还提出了一种可解释性模型,也展示了高度的精确度。