Although fast adversarial training provides an efficient approach for building robust networks, it may suffer from a serious problem known as catastrophic overfitting (CO), where multi-step robust accuracy suddenly collapses to zero. In this paper, we for the first time decouple single-step adversarial examples into data-information and self-information, which reveals an interesting phenomenon called "self-fitting". Self-fitting, i.e., the network learns the self-information embedded in single-step perturbations, naturally leads to the occurrence of CO. When self-fitting occurs, the network experiences an obvious "channel differentiation" phenomenon that some convolution channels accounting for recognizing self-information become dominant, while others for data-information are suppressed. In this way, the network can only recognize images with sufficient self-information and loses generalization ability to other types of data. Based on self-fitting, we provide new insights into the existing methods to mitigate CO and extend CO to multi-step adversarial training. Our findings reveal a self-learning mechanism in adversarial training and open up new perspectives for suppressing different kinds of information to mitigate CO.
翻译:虽然快速对抗训练提供了一种构建鲁棒性网络的高效方法,但它可能会遭受严重的问题,称为灾难性过拟合(CO),即多步鲁棒性准确性突然降为零。在本文中,我们首先将单步对抗示例解耦为数据信息和自我信息,揭示了一种有趣的现象,称为“自适应拟合”。自适应拟合,即网络学习单步扰动中嵌入的自我信息,自然导致CO的发生。当出现自适应拟合时,网络会经历明显的“通道差异”现象,一些卷积通道用于识别自我信息变得占主导地位,而其他用于数据信息的通道被压制。这样,网络只能识别具有足够自我信息的图像,并失去了对其他类型数据的泛化能力。基于自适应拟合,我们提供了缓解CO的现有方法的新见解,并将CO扩展到多步对抗性训练。我们的发现揭示了对抗性训练中的自我学习机制,并开启了抑制不同类型信息以减轻CO的新视角。