Adversarial training is one of the most effective approaches to improve model robustness against adversarial examples. However, previous works mainly focus on the overall robustness of the model, and the in-depth analysis on the role of each class involved in adversarial training is still missing. In this paper, we propose to analyze the class-wise robustness in adversarial training. First, we provide a detailed diagnosis of adversarial training on six benchmark datasets, i.e., MNIST, CIFAR-10, CIFAR-100, SVHN, STL-10 and ImageNet. Surprisingly, we find that there are remarkable robustness discrepancies among classes, leading to unbalance/unfair class-wise robustness in the robust models. Furthermore, we keep investigating the relations between classes and find that the unbalanced class-wise robustness is pretty consistent among different attack and defense methods. Moreover, we observe that the stronger attack methods in adversarial learning achieve performance improvement mainly from a more successful attack on the vulnerable classes (i.e., classes with less robustness). Inspired by these interesting findings, we design a simple but effective attack method based on the traditional PGD attack, named Temperature-PGD attack, which proposes to enlarge the robustness disparity among classes with a temperature factor on the confidence distribution of each image. Experiments demonstrate our method can achieve a higher attack rate than the PGD attack. Furthermore, from the defense perspective, we also make some modifications in the training and inference phase to improve the robustness of the most vulnerable class, so as to mitigate the large difference in class-wise robustness. We believe our work can contribute to a more comprehensive understanding of adversarial training as well as rethinking the class-wise properties in robust models.
翻译:Adversari 培训是提高模型对对抗性实例的稳健性的最有效方法之一。然而,先前的工作主要侧重于模型的总体稳健性,而对于参与对抗性培训的每个班级的作用的深入分析仍然缺乏。在本文中,我们提议分析在对抗性培训中课堂稳健性。首先,我们对六个基准数据集,即MNIST、CIFAR-10、CIFAR-100、SVHN、STL-10和图像网络的对抗性培训进行详细的分析。令人惊讶的是,我们发现各班间存在显著的稳健性差异,导致在强健模式中出现不平衡/不公平的阶级强健性。此外,我们建议分析各班间的关系,在不同的攻击和防御方法中,不平衡的稳健性强性强性强性强性强性强性。 此外,我们观察到,较强的对抗性攻击性能学习方法主要通过对脆弱班级进行更成功的攻击性能改进(即,较弱的班级性能较弱性能)。根据这些有趣的发现,我们设计了一种简单但有效的攻击性攻击性攻击性强性能方法,我们用一种更强性能方法来显示, 高性攻击性攻击性攻击性压性攻击性压性能的概率性能的实验方法可以显示我们作为传统的PGGGA攻击性能的机能的概率性能。