The extreme fragility of deep neural networks when presented with tiny perturbations in their inputs was independently discovered by several research groups in 2013, but in spite of enormous effort these adversarial examples remained a baffling phenomenon with no clear explanation. In this paper we introduce a new conceptual framework (which we call the Dimpled Manifold Model) which provides a simple explanation for why adversarial examples exist, why their perturbations have such tiny norms, why these perturbations look like random noise, and why a network which was adversarially trained with incorrectly labeled images can still correctly classify test images. In the last part of the paper we describe the results of numerous experiments which strongly support this new model, and in particular our assertion that adversarial perturbations are roughly perpendicular to the low dimensional manifold which contains all the training examples.
翻译:一些研究团体在2013年独立发现深心神经网络在投入中受到微小扰动时极易受到伤害,但尽管付出了巨大努力,这些对抗性实例仍是一个令人困惑的现象,没有清楚的解释。 在本文中,我们引入了一个新的概念框架(我们称之为“低压操纵模型 ” ), 简单地解释了为什么存在对抗性实例,为什么它们的扰动有如此微小的规范,为什么这些扰动看起来像随机噪音,为什么一个通过错误标签图像进行对抗性训练的网络仍然可以正确分类测试图像。 在论文的最后一部分,我们描述了大量实验的结果,这些实验有力地支持了这一新模型,特别是我们的说法,即对抗性扰动与包含所有培训范例的低维的方块大致相关。