The adversarial attack literature contains a myriad of algorithms for crafting perturbations which yield pathological behavior in neural networks. In many cases, multiple algorithms target the same tasks and even enforce the same constraints. In this work, we show that different attack algorithms produce adversarial examples which are distinct not only in their effectiveness but also in how they qualitatively affect their victims. We begin by demonstrating that one can determine the attack algorithm that crafted an adversarial example. Then, we leverage recent advances in parameter-space saliency maps to show, both visually and quantitatively, that adversarial attack algorithms differ in which parts of the network and image they target. Our findings suggest that prospective adversarial attacks should be compared not only via their success rates at fooling models but also via deeper downstream effects they have on victims.
翻译:对抗性攻击文献包含大量用于制造干扰的算法,这些算法在神经网络中产生病理行为。在许多情况下,多重算法针对的是相同的任务,甚至执行同样的限制。在这项工作中,我们显示不同的攻击性算法产生了不同的对抗性例子,这些例子不仅在效力上不同,而且在质量上对受害者的影响上也不同。我们首先证明,人们可以确定一个对抗性例子所形成的攻击性算法。然后,我们利用参数空间特征图的最新进展,从视觉上和数量上表明,对抗性攻击算法在网络的哪些部分和图像上都不同。我们的调查结果表明,未来的对抗性攻击不仅应该通过欺骗模型的成功率来比较,而且还应该通过对受害者的更深的下游影响来比较。