Deep neural networks (DNNs) have had many successes, but they suffer from two major issues: (1) a vulnerability to adversarial examples and (2) a tendency to elude human interpretation. Interestingly, recent empirical and theoretical evidence suggests these two seemingly disparate issues are actually connected. In particular, robust models tend to provide more interpretable gradients than non-robust models. However, whether this relationship works in the opposite direction remains obscure. With this paper, we seek empirical answers to the following question: can models acquire adversarial robustness when they are trained to have interpretable gradients? We introduce a theoretically inspired technique called Interpretation Regularization (IR), which encourages a model's gradients to (1) match the direction of interpretable target salience maps and (2) have small magnitude. To assess model performance and tease apart factors that contribute to adversarial robustness, we conduct extensive experiments on MNIST and CIFAR-10 with both $\ell_2$ and $\ell_\infty$ attacks. We demonstrate that training the networks to have interpretable gradients improves their robustness to adversarial perturbations. Applying the network interpretation technique SmoothGrad yields additional performance gains, especially in cross-norm attacks and under heavy perturbations. The results indicate that the interpretability of the model gradients is a crucial factor for adversarial robustness. Code for the experiments can be found at https://github.com/a1noack/interp_regularization.
翻译:深心神经网络(DNNS)取得了许多成功,但它们却遭遇了两大问题:(1) 容易受到对抗性例子的伤害,(2) 逃避人类解释的倾向。有趣的是,最近的实证和理论证据表明,这两个似乎截然不同的问题实际上相互关联。特别是,强健模型往往提供更易解释的梯度,而不是非紫外线模型。然而,这种关系是否在相反的方向上起作用,仍然模糊不清。有了本文件,我们寻求对下列问题的经验性答案:当模型经过训练,可以有可解释的梯度时,它们能否获得对抗性强力?我们引入了一种理论启发的技术,称为解释正规化(IR),鼓励模型的梯度(1) 与可解释的目标突出图的方向相匹配,(2) 规模较小。为了评估模型性业绩和挑剔除有助于对抗性强力模型性模型性强的因素,我们在MNISTIS和CIFAR-10上进行了广泛的实验。我们证明,可解释性梯度梯度的梯度能够提高它们对于对抗性跨度的梯度。 应用网络解释性解释性解释性模型性技术的精确度,在深度解释性/梯度上可以产生额外的业绩成果,在深度的精确的精确度上,在深度/梯度上,在深度的精确度上,在深度的精确度上可以解释性分析性能上,在深度的精确度上可以解释性地解释。