Deep Neural Networks are vulnerable to adversarial attacks. Among many defense strategies, adversarial training with untargeted attacks is one of the most recognized methods. Theoretically, the predicted labels of untargeted attacks should be unpredictable and uniformly-distributed overall false classes. However, we find that the naturally imbalanced inter-class semantic similarity makes those hard-class pairs to become the virtual targets of each other. This study investigates the impact of such closely-coupled classes on adversarial attacks and develops a self-paced reweighting strategy in adversarial training accordingly. Specifically, we propose to upweight hard-class pair loss in model optimization, which prompts learning discriminative features from hard classes. We further incorporate a term to quantify hard-class pair consistency in adversarial training, which greatly boost model robustness. Extensive experiments show that the proposed adversarial training method achieves superior robustness performance over state-of-the-art defenses against a wide range of adversarial attacks.
翻译:在许多防御战略中,非目标攻击的对抗性训练是最公认的方法之一。理论上,预计的非目标攻击的标签应该不可预测,而且应统一分布整个假等级。然而,我们发现,自然不平衡的阶级间语义相似性使得这些硬阶级对等成为彼此的虚拟目标。本项研究调查这种紧密结合的阶级对对抗性攻击的影响,并相应地在对抗性训练中制定自我节奏的重新加权战略。具体地说,我们提议在模型优化中增加重量级硬级对等损失,这促使从硬级中学习歧视性特征。我们进一步纳入一个术语,在对抗性训练中量化硬级对等一致性,这极大地增强了模型的稳健性。广泛的实验表明,拟议的对抗性训练方法在对抗性攻击方面优于最先进的防御能力。