Adversarial attacks modify images with perturbations that change the prediction of classifiers. These modified images, known as adversarial examples, expose the vulnerabilities of deep neural network classifiers. In this paper, we investigate the predictability of the mapping between the classes predicted for original images and for their corresponding adversarial examples. This predictability relates to the possibility of retrieving the original predictions and hence reversing the induced misclassification. We refer to this property as the reversibility of an adversarial attack, and quantify reversibility as the accuracy in retrieving the original class or the true class of an adversarial example. We present an approach that reverses the effect of an adversarial attack on a classifier using a prior set of classification results. We analyse the reversibility of state-of-the-art adversarial attacks on benchmark classifiers and discuss the factors that affect the reversibility.
翻译:反向攻击使图像发生扰动,从而改变分类者的预测。这些修改后的图像被称为对抗性实例,暴露了深神经网络分类者的弱点。在本文中,我们研究了原始图像预测的类别及其对应的对抗性实例之间绘图的可预测性。这种可预测性涉及检索原始预测的可能性,从而扭转诱发的分类错误。我们把这一属性称为对抗性攻击的可逆性,并将可逆性量化为检索原始类别或对抗性实例真实类别的准确性。我们提出了一个方法,用一套先前的分类结果来扭转对抗性攻击对分类者的影响。我们分析了对基准分类者进行的最新对抗性攻击的可逆性,并讨论了影响可逆性的因素。