Adversarial training algorithms have been proved to be reliable to improve machine learning models' robustness against adversarial examples. However, we find that adversarial training algorithms tend to introduce severe disparity of accuracy and robustness between different groups of data. For instance, a PGD adversarially trained ResNet18 model on CIFAR-10 has 93% clean accuracy and 67% PGD l-infty-8 robust accuracy on the class "automobile" but only 65% and 17% on the class "cat". This phenomenon happens in balanced datasets and does not exist in naturally trained models when only using clean samples. In this work, we empirically and theoretically show that this phenomenon can happen under general adversarial training algorithms which minimize DNN models' robust errors. Motivated by these findings, we propose a Fair-Robust-Learning (FRL) framework to mitigate this unfairness problem when doing adversarial defenses. Experimental results validate the effectiveness of FRL.
翻译:事实证明,Adversarial培训算法对于提高机器学习模型对对抗性实例的稳健性来说是可靠的。然而,我们发现,对抗性培训算法往往在不同数据组之间造成高度的准确性和稳健性差异。例如,关于CIFAR-10的PGD对抗性训练的ResNet18模型在“汽车”类中具有93%的清洁性,67%的PGD l-infty-8稳健性准确性,但在“猫”类中只有65%和17%。这种现象出现在平衡的数据集中,而仅在使用清洁样本时在自然培训模型中并不存在。在这项工作中,我们从经验上和理论上都表明,这种现象可以在一般对抗性培训算法下发生,将DNN模型的稳健错误降到最低。受这些发现的影响,我们提议了一个公平-RObust-Llearning(FRL)框架,以在进行对抗性辩护时减轻这种不公平性的问题。实验结果证实了FRL的有效性。