We propose a novel method for creating adversarial examples. Instead of perturbing pixels, we use an encoder-decoder representation of the input image and perturb intermediate layers in the decoder. This changes the high-level features provided by the generative model. Therefore, our perturbation possesses semantic meaning, such as a longer beak or green tints. We formulate this task as an optimization problem by minimizing the Wasserstein distance between the adversarial and initial images under a misclassification constraint. We employ the projected gradient method with a simple inexact projection. Due to the projection, all iterations are feasible, and our method always generates adversarial images. We perform numerical experiments on the MNIST and ImageNet datasets in both targeted and untargeted settings. We demonstrate that our adversarial images are much less vulnerable to steganographic defence techniques than pixel-based attacks. Moreover, we show that our method modifies key features such as edges and that defence techniques based on adversarial training are vulnerable to our attacks.
翻译:我们建议一种创新的方法来创建对抗性实例。 我们使用一种不触动像素的方法, 而不是在解码器中输入图像和扰动中间层的编码器解码器。 这改变了基因模型所提供的高层次特征。 因此, 我们的扰动具有语义含义, 比如更长的 beak 或绿色的色素。 我们把这个任务描述为一个优化问题, 在错误的分类限制下将对抗性图像和初始图像之间的瓦瑟斯坦距离最小化。 我们使用预测的梯度方法, 使用简单不精确的投影 。 由于投影, 所有迭代都是可行的, 我们的方法总是生成对抗性图像 。 我们在目标和非目标环境中对 MMSIS 和图像网络数据集进行数字实验。 我们证明我们的对抗性图像比像素基攻击更容易受到血清防御技术的影响。 此外, 我们显示我们的方法会改变边缘等关键特征, 并且基于对抗性训练的防御技术很容易受到攻击 。