Deep neural networks (DNNs) are known to be vulnerable to adversarial examples that are crafted with imperceptible perturbations, i.e., a small change in an input image can induce a mis-classification, and thus threatens the reliability of deep learning based deployment systems. Adversarial training (AT) is often adopted to improve the robustness of DNNs through training a mixture of corrupted and clean data. However, most of AT based methods are ineffective in dealing with \textit{transferred adversarial examples} which are generated to fool a wide spectrum of defense models, and thus cannot satisfy the generalization requirement raised in real-world scenarios. Moreover, adversarially training a defense model in general cannot produce interpretable predictions towards the inputs with perturbations, whilst a highly interpretable robust model is required by different domain experts to understand the behaviour of a DNN. In this work, we propose an approach based on Jacobian norm and Selective Input Gradient Regularization (J-SIGR), which suggests the linearized robustness through Jacobian normalization and also regularizes the perturbation-based saliency maps to imitate the model's interpretable predictions. As such, we achieve both the improved defense and high interpretability of DNNs. Finally, we evaluate our method across different architectures against powerful adversarial attacks. Experiments demonstrate that the proposed J-SIGR confers improved robustness against transferred adversarial attacks, and we also show that the predictions from the neural network are easy to interpret.
翻译:深心神经网络(DNNS) 众所周知,很容易受到以无法察觉的干扰而形成的对抗性例子的伤害,即输入图像的微小变化可能导致错误分类,从而威胁到深层次学习部署系统的可靠性。 反向培训(AT)往往通过训练腐败和干净的数据来提高DNN的稳健性。然而,基于AT的大多数方法在处理大量国防模型的欺骗性模范,从而无法满足真实世界的网络情景中所提出的一般化要求,即输入图像的微小变化可能诱发错误分类,从而威胁到深层次学习部署系统的可靠性。 此外,对一般防御模型的对抗性培训无法产生对基于深层次部署系统的投入的可解释性预测,而不同的领域专家则需要一种高度可解释的稳健健模式来理解DNNN的行为。 在这项工作中,我们建议一种基于叶科比规范的方法,并提议有选择性的输入激烈的对正态对准性对立性(J-SIGR),这表明通过雅各式的常规化的稳健性,同时也规范了从真实性对真实性解释,我们基于清晰度的J-GRRI的预测性进行这样的快速分析。