Adversarial attacks hamper the decision-making ability of neural networks by perturbing the input signal. The addition of calculated small distortion to images, for instance, can deceive a well-trained image classification network. In this work, we propose a novel attack technique called Sparse Adversarial and Interpretable Attack Framework (SAIF). Specifically, we design imperceptible attacks that contain low-magnitude perturbations at a small number of pixels and leverage these sparse attacks to reveal the vulnerability of classifiers. We use the Frank-Wolfe (conditional gradient) algorithm to simultaneously optimize the attack perturbations for bounded magnitude and sparsity with $O(1/\sqrt{T})$ convergence. Empirical results show that SAIF computes highly imperceptible and interpretable adversarial examples, and outperforms state-of-the-art sparse attack methods on the ImageNet dataset.
翻译:反向攻击干扰输入信号,从而妨碍神经网络的决策能力。例如,在图像中增加计算出来的小扭曲,可以欺骗训练有素的图像分类网络。在这项工作中,我们提出一种新型攻击技术,叫做Sprassar Aversarial 和可解释攻击框架(SAIF )。具体地说,我们设计了在少量像素上包含低磁性扰动的无法察觉的攻击,并利用这些微小的攻击来显示分类器的脆弱性。我们使用Frank-Wolfe(有条件梯度)算法来同时优化受约束的尺寸和宽度攻击的扰动,同时使用O(1/\sqrt{T})$的趋同。根据经验,结果显示SAIF在图像网络数据集上拼写了高度不易感知和可解释的对立例子,并超越了最先进的稀有攻击方法。