Deep neural networks (DNNs) are vulnerable to adversarial noise. Their adversarial robustness can be improved by exploiting adversarial examples. However, given the continuously evolving attacks, models trained on seen types of adversarial examples generally cannot generalize well to unseen types of adversarial examples. To solve this problem, in this paper, we propose to remove adversarial noise by learning generalizable invariant features across attacks which maintain semantic classification information. Specifically, we introduce an adversarial feature learning mechanism to disentangle invariant features from adversarial noise. A normalization term has been proposed in the encoded space of the attack-invariant features to address the bias issue between the seen and unseen types of attacks. Empirical evaluations demonstrate that our method could provide better protection in comparison to previous state-of-the-art approaches, especially against unseen types of attacks and adaptive attacks.
翻译:深神经网络(DNNs)很容易受到对抗性噪音的影响。通过利用对抗性例子,可以改善它们的对抗性强健性。但是,鉴于不断演变的攻击,经过培训的关于明显对抗性例子的模式一般不能被广泛归纳为看不见的对抗性例子。为了解决这个问题,我们在本文件中建议,通过学习保持语义分类信息的常见攻击性差异性特征来消除对抗性噪音。具体地说,我们引入了一种对抗性特征学习机制,以分离对抗性噪音的异同性特征。在攻击性攻击性特征编码空间中提出了一种正常化术语,以解决已知和不可见的攻击类型之间的偏见问题。经验评估表明,与以往最先进的方法相比,我们的方法可以提供更好的保护,特别是防止隐性攻击和适应性攻击。