It is well understood that modern deep networks are vulnerable to adversarial attacks. However, conventional attack methods fail to produce adversarial perturbations that are intelligible to humans, and they pose limited threats in the physical world. To study feature-class associations in networks and better understand their vulnerability to attacks in the real world, we develop feature-level adversarial perturbations using deep image generators and a novel optimization objective. We term these feature-fool attacks. We show that they are versatile and use them to generate targeted feature-level attacks at the ImageNet scale that are simultaneously interpretable, universal to any source image, and physically-realizable. These attacks reveal spurious, semantically-describable feature/class associations that can be exploited by novel combinations of objects. We use them to guide the design of "copy/paste" adversaries in which one natural image is pasted into another to cause a targeted misclassification. Code is available at https://github.com/thestephencasper/feature_fool.
翻译:众所周知,现代深层次的网络很容易受到对抗性攻击。然而,常规攻击方法未能产生对人类而言容易理解的对抗性干扰,在物理界构成有限的威胁。为了研究网络中的特征类协会,并更好地了解其在现实世界中易受攻击的脆弱性,我们利用深层图像生成器和一个新颖的优化目标开发了特征级的对抗性干扰。我们将这些特征-泡沫攻击称为这些特征-泡沫攻击。我们显示它们具有多功能性,并用来在图像网络规模上产生有针对性的特征级攻击,这种攻击可同时解释,为任何来源图像所普遍,而且可以实际实现。这些攻击暴露出虚假的、可描述性特征/阶级协会,可以通过新颖的物体组合加以利用。我们用它们来指导“复制/帕斯特”对手的设计,其中一种自然图像被粘贴到另一个,以造成定向的分类错误。守则可在https://github.com/thestephencasper/fetary_fool查阅。