Adversarial defenses train deep neural networks to be invariant to the input perturbations from adversarial attacks. Almost all defense strategies achieve this invariance through adversarial training i.e. training on inputs with adversarial perturbations. Although adversarial training is successful at mitigating adversarial attacks, the behavioral differences between adversarially-trained (AT) models and standard models are still poorly understood. Motivated by a recent study on learning robustness without input perturbations by distilling an AT model, we explore what is learned during adversarial training by analyzing the distribution of logits in AT models. We identify three logit characteristics essential to learning adversarial robustness. First, we provide a theoretical justification for the finding that adversarial training shrinks two important characteristics of the logit distribution: the max logit values and the "logit gaps" (difference between the logit max and next largest values) are on average lower for AT models. Second, we show that AT and standard models differ significantly on which samples are high or low confidence, then illustrate clear qualitative differences by visualizing samples with the largest confidence difference. Finally, we find learning information about incorrect classes to be essential to learning robustness by manipulating the non-max logit information during distillation and measuring the impact on the student's robustness. Our results indicate that learning some adversarial robustness without input perturbations requires a model to learn specific sample-wise confidences and incorrect class orderings that follow complex distributions.
翻译:对抗性防御训练是深层的神经网络,对来自对抗性攻击的投入干扰是变化性的。几乎所有的防御战略都通过对抗性训练,即对对抗性攻击的投入进行训练。虽然对抗性训练成功地减轻了对抗性攻击,但敌对性训练模式和标准模型之间的行为差异仍然不太为人所理解。最近的一项研究是学习强性,而没有输入干扰,通过蒸馏一个AT的样本模型,我们探索了在对抗性训练期间通过分析AT模型的逻辑分布而学到了什么教训。我们找出了三种对学习对抗性攻击性攻击的投入至关重要的逻辑特征。首先,我们从理论上解释了以下结论:对抗性训练缩小了对敌对性攻击的两种重要特征:最大logit值和“logit 最大值和下一个最大值之间的差异”对于AT模型来说平均较低。第二,我们发现AT和标准模型在哪些样本具有高度或低度的可信度方面有很大差异,然后通过直观的样本的精确度来显示质量差异,而没有最强度的精确性分析。最后,我们学习了某种不准确性的数据,我们通过学习不精确的校正结果。我们学习了某种不精确的模型。最后,我们学习了某种信息,我们发现一些信息到不正确性的数据。