Adversarial training is a popular method to robustify models against adversarial attacks. However, it exhibits much more severe overfitting than training on clean inputs. In this work, we investigate this phenomenon from the perspective of training instances, i.e., training input-target pairs. Based on a quantitative metric measuring instances' difficulty, we analyze the model's behavior on training instances of different difficulty levels. This lets us show that the decay in generalization performance of adversarial training is a result of the model's attempt to fit hard adversarial instances. We theoretically verify our observations for both linear and general nonlinear models, proving that models trained on hard instances have worse generalization performance than ones trained on easy instances. Furthermore, we prove that the difference in the generalization gap between models trained by instances of different difficulty levels increases with the size of the adversarial budget. Finally, we conduct case studies on methods mitigating adversarial overfitting in several scenarios. Our analysis shows that methods successfully mitigating adversarial overfitting all avoid fitting hard adversarial instances, while ones fitting hard adversarial instances do not achieve true robustness.
翻译:对抗性培训是强化对抗性攻击模式的流行方法。 但是,它比清洁投入培训要严重得多。 在这项工作中,我们从培训实例的角度来调查这一现象,即培训投入目标对等。根据定量衡量实例的困难,我们分析了模型在不同困难程度的培训实例方面的行为。这让我们可以表明,对抗性培训一般化表现的衰败是该模型试图适应硬性对抗性攻击实例的结果。我们理论上核查了我们对线性和一般非线性模型的观察,证明在困难情况下培训的模型比在简单例子中培训的模型的差。此外,我们还证明,在不同的困难程度培训的模型之间,一般化差距随着对抗性预算规模的大小而加大。最后,我们进行了关于减轻对抗性过分适应性攻击的方法的个案研究。我们的分析表明,成功地减少对抗性过分适应性所有硬性对抗性对抗性攻击实例的方法避免了适当的硬性对抗性攻击实例,而那些适合硬性对抗性对抗性攻击实例的模型则没有达到真正的强健性。