通过故意反反干扰探测深心神经网络的后门 (Detecting Backdoor in Deep Neural Networks via Intentional Adversarial Perturbations)

Recent researches show that deep learning model is susceptible to backdoor attacks where the backdoor embedded in the model will be triggered when a backdoor instance arrives. In this paper, a novel backdoor detection method based on adversarial examples is proposed. The proposed method leverages intentional adversarial perturbations to detect whether the image contains a trigger, which can be applied in two scenarios (sanitize the training set in training stage and detect the backdoor instances in inference stage). Specifically, given an untrusted image, the adversarial perturbation is added to the input image intentionally, if the prediction of model on the perturbed image is consistent with that on the unperturbed image, the input image will be considered as a backdoor instance. The proposed adversarial perturbation based method requires low computational resources and maintains the visual quality of the images. Experimental results show that, the proposed defense method reduces the backdoor attack success rates from 99.47%, 99.77% and 97.89% to 0.37%, 0.24% and 0.09% on Fashion-MNIST, CIFAR-10 and GTSRB datasets, respectively. Besides, the proposed method maintains the visual quality of the image as the added perturbation is very small. In addition, for attacks under different settings (trigger transparency, trigger size and trigger pattern), the false acceptance rates of the proposed method are as low as 1.2%, 0.3% and 0.04% on Fashion-MNIST, CIFAR-10 and GTSRB datasets, respectively, which demonstrates that the proposed method can achieve high defense performance against backdoor attacks under different attack settings.

翻译：最近的研究显示,深层次学习模式很容易受到后门攻击,因为当后门实例到来时,该模型嵌入的后门就会触发后门攻击。在本文中,提出了一种基于对抗性实例的新颖的后门探测方法。拟议方法利用有意的对抗性扰动来检测图像是否包含触发器,这可以在两种情景中应用(在培训阶段保持培训设置,在推断阶段探测后门事件)。具体地说,鉴于一个不信任的图像,在输入图像中会有意添加对立的扰动。如果对嵌入后门图像的模型的预测与在无扰动图像上显示的后门探测方法一致,那么输入图像将被视为后门探测器的触发器。拟议的防御方法将后门攻击成功率从99.47%、99.77%和97.89%降至0.37 %、0.24 %和0.09 %,在Fashakin-MINIT、CIFAR-10和GTSRB数据显示后方攻击的低度性能度,因此,拟议的方法可以在不同的前方标准下维持不同的前方攻击前方标准。

相关内容

Fashion MNIST (数据集)

关注 3

FashionMNIST 是一个替代 MNIST 手写数字集的图像数据集。它是由 Zalando（一家德国的时尚科技公司）旗下的研究部门提供。其涵盖了来自 10 种类别的共 7 万个不同商品的正面图片。FashionMNIST 的大小、格式和训练集/测试集划分与原始的 MNIST 完全一致。60000/10000 的训练测试数据划分，28x28 的灰度图片。你可以直接用它来测试你的机器学习和深度学习算法性能，且不需要改动任何的代码。

近期必读的6篇顶会CVPR 2021【对抗攻击】相关论文和代码

专知会员服务

51+阅读 · 2021年7月10日

“CVPR 2021 接受论文列表 1663篇论文都在这了

专知会员服务

32+阅读 · 2021年6月12日

【Google】深度学习对抗鲁棒性，43页ppt

专知会员服务

45+阅读 · 2020年10月31日