Recent researches show that deep learning model is susceptible to backdoor attacks where the backdoor embedded in the model will be triggered when a backdoor instance arrives. In this paper, a novel backdoor detection method based on adversarial examples is proposed. The proposed method leverages intentional adversarial perturbations to detect whether the image contains a trigger, which can be applied in two scenarios (sanitize the training set in training stage and detect the backdoor instances in inference stage). Specifically, given an untrusted image, the adversarial perturbation is added to the input image intentionally, if the prediction of model on the perturbed image is consistent with that on the unperturbed image, the input image will be considered as a backdoor instance. The proposed adversarial perturbation based method requires low computational resources and maintains the visual quality of the images. Experimental results show that, the proposed defense method reduces the backdoor attack success rates from 99.47%, 99.77% and 97.89% to 0.37%, 0.24% and 0.09% on Fashion-MNIST, CIFAR-10 and GTSRB datasets, respectively. Besides, the proposed method maintains the visual quality of the image as the added perturbation is very small. In addition, for attacks under different settings (trigger transparency, trigger size and trigger pattern), the false acceptance rates of the proposed method are as low as 1.2%, 0.3% and 0.04% on Fashion-MNIST, CIFAR-10 and GTSRB datasets, respectively, which demonstrates that the proposed method can achieve high defense performance against backdoor attacks under different attack settings.
翻译:最近的研究显示,深层次学习模式很容易受到后门攻击,因为当后门实例到来时,该模型嵌入的后门就会触发后门攻击。在本文中,提出了一种基于对抗性实例的新颖的后门探测方法。拟议方法利用有意的对抗性扰动来检测图像是否包含触发器,这可以在两种情景中应用(在培训阶段保持培训设置,在推断阶段探测后门事件)。具体地说,鉴于一个不信任的图像,在输入图像中会有意添加对立的扰动。如果对嵌入后门图像的模型的预测与在无扰动图像上显示的后门探测方法一致,那么输入图像将被视为后门探测器的触发器。 拟议的防御方法将后门攻击成功率从99.47%、99.77%和97.89%降至0.37 %、0.24 %和0.09 %,在Fashakin-MINIT、CIFAR-10和GTSRB数据显示后方攻击的低度性能度,因此,拟议的方法可以在不同的前方标准下维持不同的前方攻击前方标准。