Deep neural networks (DNNs) are vulnerable to adversarial examples that can trigger misclassification of DNNs but may be imperceptible to human perception. Adversarial attack has been an important way to evaluate the robustness of DNNs. Existing attack methods on the construction of adversarial examples use such $\ell_p$ distance as a similarity metric to perturb samples. However, this kind of metric is incompatible with the underlying real-world image formation and human visual perception. In this paper, we first propose an internal Wasserstein distance (IWD) to measure image similarity between a sample and its adversarial example. We apply IWD to perform adversarial attack and defense. Specifically, we develop a novel attack method by capturing the distribution of patches in original samples. In this case, our approach is able to generate semantically similar but diverse adversarial examples that are more difficult to defend by existing defense methods. Relying on IWD, we also build a new defense method that seeks to learn robust models to defend against unseen adversarial examples. We provide both thorough theoretical and empirical evidence to support our methods.
翻译:深神经网络(DNN)很容易受到可能引起DNN错误分类的对抗性例子的伤害,但可能无法被人类感知。对抗性攻击是评价DNN是否稳健的一个重要方法。关于建造对抗性例子的现有攻击方法使用美元/ ell_ p$的距离作为干扰样品的类似度量。然而,这种衡量标准与潜在的真实世界形象形成和人类视觉感知不相容。在本文中,我们首先提议建立内部瓦西尔斯坦距离(IWD),以测量抽样与其对抗性例子之间的相似性。我们应用IWD进行对抗性攻击和防御。具体地说,我们开发了一种新的攻击方法,通过捕捉原始样品中的补丁的分布。在这种情况下,我们的方法能够产生与现有防御方法更加难以捍卫的精细相似但多样的对抗性例子。根据IWD,我们还建立了一种新的防御方法,以学习强大的模型来抵御看不见的对抗性例子。我们提供了透彻的理论和经验证据来支持我们的方法。