Deep Neural Network (DNN) are vulnerable to adversarial attacks. As a countermeasure, adversarial training aims to achieve robustness based on the min-max optimization problem and it has shown to be one of the most effective defense strategies. However, in this work, we found that compared with natural training, adversarial training fails to learn better feature representations for either clean or adversarial samples, which can be one reason why adversarial training tends to have severe overfitting issues and less satisfied generalize performance. Specifically, we observe two major shortcomings of the features learned by existing adversarial training methods:(1) low intra-class feature similarity; and (2) conservative inter-classes feature variance. To overcome these shortcomings, we introduce a new concept of adversarial training graph (ATG) with which the proposed adversarial training with feature separability (ATFS) enables to coherently boost the intra-class feature similarity and increase inter-class feature variance. Through comprehensive experiments, we demonstrate that the proposed ATFS framework significantly improves both clean and robust performance.
翻译:深神经网(DNN)很容易受到对抗性攻击。作为一项反措施,对抗性训练的目的是在微轴优化问题的基础上实现稳健,这已证明是最有效的防御战略之一。然而,在这项工作中,我们发现,与自然训练相比,对抗性训练未能对清洁或对抗性抽样进行更好的特征描述,这可能是敌对性训练往往有严重过大的问题和不太令人满意的一般性能的原因之一。具体地说,我们看到现有对抗性训练方法所学到的特征有两个主要缺陷:(1) 低级内部特征相似;和(2) 保守的阶级间差异。为了克服这些缺陷,我们引入了新的对抗性训练图概念。据此,拟议的具有特征分离性的对抗性训练能够连贯地提高阶级内部特征相似性并增加阶级间特征差异。通过全面试验,我们证明拟议的ATFS框架大大改进了清洁和稳健的业绩。