The reasons why Deep Neural Networks are susceptible to being fooled by adversarial examples remains an open discussion. Indeed, many different strategies can be employed to efficiently generate adversarial attacks, some of them relying on different theoretical justifications. Among these strategies, universal (input-agnostic) perturbations are of particular interest, due to their capability to fool a network independently of the input in which the perturbation is applied. In this work, we investigate an intriguing phenomenon of universal perturbations, which has been reported previously in the literature, yet without a proven justification: universal perturbations change the predicted classes for most inputs into one particular (dominant) class, even if this behavior is not specified during the creation of the perturbation. In order to justify the cause of this phenomenon, we propose a number of hypotheses and experimentally test them using a speech command classification problem in the audio domain as a testbed. Our analyses reveal interesting properties of universal perturbations, suggest new methods to generate such attacks and provide an explanation of dominant classes, under both a geometric and a data-feature perspective.
翻译:深神经网络很容易被对抗性实例所欺骗的原因仍然是公开的讨论。 事实上,许多不同的战略可以用来有效地产生对抗性攻击,其中一些战略依赖不同的理论理由。 在这些战略中,普遍(投入-不可知性)扰动特别有意义,因为它们有能力愚弄一个独立于扰动所起作用之外的网络。在这项工作中,我们调查了一种令人感兴趣的普遍扰动现象,这在文献中已经报告过,但却没有经过证明的理由:普遍扰动可以将大多数投入的预测类别改变为某一特定(主导)类别,即使这种行为在制造扰动期间没有具体说明。为了证明这种现象的原因,我们提出一些假说,并实验性地用音频领域的语音指令分类问题作为测试。我们的分析揭示了普遍扰动的有趣特性,提出了产生这种攻击的新方法,并在几何和数据精确角度下对主要类别作出解释。