Most of the works follow such definition of adversarial example that is imperceptible to humans but can fool the deep neural networks (DNNs). Some works find another interesting form of adversarial examples such as one which is unrecognizable to humans, but DNNs classify it as one class with high confidence and adversarial patch. Based on this phenomenon, in this paper, from the perspective of cognition of humans and machines, we propose a new definition of adversarial examples. We show that imperceptible adversarial examples, unrecognizable adversarial examples, and adversarial patches are derivates of generalized adversarial examples. Then, we propose three types of adversarial attacks based on the generalized definition. Finally, we propose a defence mechanism that achieves state-of-the-art performance. We construct a lossy compression function to filter out the redundant features generated by the network. In this process, the perturbation produced by the attacker will be filtered out. Therefore, the defence mechanism can effectively improve the robustness of the model. The experiments show that our attack methods can effectively generate adversarial examples, and our defence method can significantly improve the adversarial robustness of DNNs compared with adversarial training. As far as we know, our defending method achieves the best performance even though we do not adopt adversarial training.
翻译:多数作品遵循了对人来说是无法察觉的对抗性范例的定义,但可以愚弄深层神经网络(DNNS)。有些作品发现另一种有趣的对抗性例子形式,例如对人类而言无法辨认的对抗性例子,但DNNS将其归类为具有高度信心和对抗性补丁的一类。根据这一现象,我们在本文件中从人类和机器认知的角度提出了一个新的对抗性例子定义。我们表明,不可察觉的对抗性例子、无法辨认的对抗性例子和对抗性补丁是普遍对抗性例子的衍生物。然后,我们根据普遍定义提出三种类型的对抗性攻击。最后,我们提出一个能达到最新性能的防御性机制。我们从人类和机器认知的角度出发,构建了一种损失压缩功能,以过滤网络产生的冗余特征。在这个过程中,攻击者产生的扰动性例子将被过滤出来。因此,防御性机制可以有效地改进模型的稳健性。实验表明,我们的攻击性攻击方法不能有效地通过对抗性训练来改进我们的最佳的对抗性训练方法,我们通过最强的对抗性训练来进行防御性训练。我们最强的防御性训练。