Adversarial training is one of the most effective approaches to improve model robustness against adversarial examples. However, previous works mainly focus on the overall robustness of the model, and the in-depth analysis on the role of each class involved in adversarial training is still missing. In this paper, we propose to analyze the class-wise robustness in adversarial training. First, we provide a detailed diagnosis of adversarial training on six benchmark datasets, i.e., MNIST, CIFAR-10, CIFAR-100, SVHN, STL-10 and ImageNet. Surprisingly, we find that there are remarkable robustness discrepancies among classes, leading to unbalance/unfair class-wise robustness in the robust models. Furthermore, we keep investigating the relations between classes and find that the unbalanced class-wise robustness is pretty consistent among different attack and defense methods. Moreover, we observe that the stronger attack methods in adversarial learning achieve performance improvement mainly from a more successful attack on the vulnerable classes (i.e., classes with less robustness). Inspired by these interesting findings, we design a simple but effective attack method based on the traditional PGD attack, named Temperature-PGD attack, which proposes to enlarge the robustness disparity among classes with a temperature factor on the confidence distribution of each image. Experiments demonstrate our method can achieve a higher attack rate than the PGD attack. Furthermore, from the defense perspective, we also make some modifications in the training and inference phases to improve the robustness of the most vulnerable class, so as to mitigate the large difference in class-wise robustness. We believe our work can contribute to a more comprehensive understanding of adversarial training as well as rethinking the class-wise properties in robust models.
翻译:Adversari 培训是针对对抗性实例提高模型稳健性的最有效方法之一。然而,先前的工作主要侧重于模型的总体稳健性,而对于参与对抗性培训的每个班级的作用的深入分析仍然缺乏。在本文中,我们提议分析在对抗性培训中课堂稳健性。首先,我们详细分析在六个基准数据集,即MNIST、CIFAR-10、CIFAR-100、SVHN、STL-10和图像网上进行的对抗性培训。令人惊讶的是,我们发现各班间存在显著的稳健性差异,导致在强势模式中出现不均/不公平的课堂稳健性。此外,我们不断研究各班间的关系,发现不同攻击性和防御性强性培训方法之间相当一致。 此外,我们观察到,较强的对抗性学习方法能够改善业绩,主要来自对脆弱班级(即较成功的攻击性能较弱的班级)的稳健性能。 令人惊讶的是,我们根据这些有趣的发现,我们设计了一种简单但有效的攻击性攻击性分析方法,从传统的PGI(G)等级)的稳健性分析方法显示,我们最稳健性攻击性攻击性培训方法可以使攻击性变强性变强性变强性能成为我们攻击性攻击性攻击性。