White-box Adversarial Example (AE) attacks towards Deep Neural Networks (DNNs) have a more powerful destructive capacity than black-box AE attacks in the fields of AE strategies. However, almost all the white-box approaches lack interpretation from the point of view of DNNs. That is, adversaries did not investigate the attacks from the perspective of interpretable features, and few of these approaches considered what features the DNN actually learns. In this paper, we propose an interpretable white-box AE attack approach, DI-AA, which explores the application of the interpretable approach of the deep Taylor decomposition in the selection of the most contributing features and adopts the Lagrangian relaxation optimization of the logit output and L_p norm to further decrease the perturbation. We compare DI-AA with six baseline attacks (including the state-of-the-art attack AutoAttack) on three datasets. Experimental results reveal that our proposed approach can 1) attack non-robust models with comparatively low perturbation, where the perturbation is closer to or lower than the AutoAttack approach; 2) break the TRADES adversarial training models with the highest success rate; 3) the generated AE can reduce the robust accuracy of the robust black-box models by 16% to 31% in the black-box transfer attack.
翻译:白箱 Adversarial 示例(AE) 对深神经网络(DNN)的袭击比对AE战略领域的黑盒 AE袭击更具强大的破坏能力。 然而,几乎所有的白色框方法都缺乏从DNN的角度来看的解释。 也就是说, 对手没有从可解释的特征的角度来调查这些袭击, 而这些方法中很少考虑DNN实际学习什么特征。 在本文中, 我们提议了一种可解释的白箱 AE袭击方法, DI-AAA, 探索了在选择最大贡献功能时采用深泰勒拆解的可解释方法, 并采用了对日志输出和L_p 规范的拉格朗式放松优化以进一步减少扰动。 我们把DI-AA与三个数据集的六次基线攻击(包括最新攻击自动包)相比较。 实验结果显示, 我们的拟议方法可以1) 攻击非破坏型模型, 其渗透性相对较低, 最能贡献性特性的黑箱操作率接近于最坚固的对称的AAAA-A-A-A-A-A-A-A-A-A-A-B-B-B-B-B-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C- C- C- C- C-C-C- 成功率率方法比最高成功率率降低了B-B-B-C-BAR-C-B-C-C-C-C-C-C-C-C-C-C-C-B-C-C-C-C-C-C-C-C-B-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-B-B-C-B-B-B-C-C-C-C-C-B-C-C-C-C-C-C-