This paper provides a unified view to explain different adversarial attacks and defense methods, i.e. the view of multi-order interactions between input variables of DNNs. Based on the multi-order interaction, we discover that adversarial attacks mainly affect high-order interactions to fool the DNN. Furthermore, we find that the robustness of adversarially trained DNNs comes from category-specific low-order interactions. Our findings provide a potential method to unify adversarial perturbations and robustness, which can explain the existing defense methods in a principle way. Besides, our findings also make a revision of previous inaccurate understanding of the shape bias of adversarially learned features.
翻译:本文提供了一个统一的观点来解释不同的对抗性攻击和防御方法,即DNN输入变量之间的多级互动观点。基于多级互动,我们发现对抗性攻击主要影响高级互动以愚弄DNN。此外,我们发现,经过敌对性训练的DNN的强力来自特定类别的低级互动。我们的调查结果为统一对抗性干扰和稳健性提供了一种潜在的方法,这可以从原则上解释现有的防御方法。此外,我们的调查结果还修正了以前对敌对性学习特征的形状偏差的不准确理解。