Minimal adversarial perturbations added to inputs have been shown to be effective at fooling deep neural networks. In this paper, we introduce several innovations that make white-box targeted attacks follow the intuition of the attacker's goal: to trick the model to assign a higher probability to the target class than to any other, while staying within a specified distance from the original input. First, we propose a new loss function that explicitly captures the goal of targeted attacks, in particular, by using the logits of all classes instead of just a subset, as is common. We show that Auto-PGD with this loss function finds more adversarial examples than it does with other commonly used loss functions. Second, we propose a new attack method that uses a further developed version of our loss function capturing both the misclassification objective and the $L_{\infty}$ distance limit $\epsilon$. This new attack method is relatively 1.5--4.2% more successful on the CIFAR10 dataset and relatively 8.2--14.9% more successful on the ImageNet dataset, than the next best state-of-the-art attack. We confirm using statistical tests that our attack outperforms state-of-the-art attacks on different datasets and values of $\epsilon$ and against different defenses.
翻译:在输入中添加的最小对抗性扰动已被证明对愚弄深神经网络有效。 在本文中, 我们引入了数项创新, 使白箱有针对性的袭击以攻击者目标的直觉为依据: 欺骗模型给目标类别分配比其他任何目标更高的概率, 同时在与原始输入的距离内保持一定的距离。 首先, 我们提议一个新的损失函数, 明确捕捉定向袭击的目标, 特别是使用所有类别的登录数据, 而不是普通的子集。 我们显示, 这个损失函数的自动 PGD 找到了比其他常用损失函数更多的对抗性例子。 其次, 我们提出一种新的攻击方法, 使用更先进的损失函数版本, 捕捉错误的分类目标, 和美元的距离限制 $\ 。 这种新的攻击方法在 CIRA10 数据集上相对成功1.5-4.2%, 而在图像网络数据集上相对成功8. 2-14.9 %, 比下一个最佳的状态- 最常用的损失函数攻击功能要成功。 我们确认, 使用统计测试我们的攻击和不同防御系统的数据。