The existence of adversarial attacks on convolutional neural networks (CNN) questions the fitness of such models for serious applications. The attacks manipulate an input image such that misclassification is evoked while still looking normal to a human observer -- they are thus not easily detectable. In a different context, backpropagated activations of CNN hidden layers -- "feature responses" to a given input -- have been helpful to visualize for a human "debugger" what the CNN "looks at" while computing its output. In this work, we propose a novel detection method for adversarial examples to prevent attacks. We do so by tracking adversarial perturbations in feature responses, allowing for automatic detection using average local spatial entropy. The method does not alter the original network architecture and is fully human-interpretable. Experiments confirm the validity of our approach for state-of-the-art attacks on large-scale models trained on ImageNet.
翻译:对卷发神经网络(CNN)的对抗性攻击的存在质疑此类模型是否适合严重应用。这些攻击操纵了输入图像,使对人类观察者来说,错误分类仍然正常,因此不容易检测。在另一种情况下,对CNN隐藏的层层进行反演激活,即对特定输入的“功能响应”,有助于想象CNN在计算其输出时“看到”什么是人的“调试器”。在这项工作中,我们提议了一种新颖的辨识方法,用于对抗性实例,以防止攻击。我们这样做的方法是跟踪特征反应中的对称干扰,允许使用普通的当地空间诱变器自动检测。这种方法不会改变原始网络结构,而且是完全可以人类解释的。实验证实了我们对在图像网络上培训的大型模型进行最先进的攻击的方法的有效性。