Adversarial purification refers to a class of defense methods that remove adversarial perturbations using a generative model. These methods do not make assumptions on the form of attack and the classification model, and thus can defend pre-existing classifiers against unseen threats. However, their performance currently falls behind adversarial training methods. In this work, we propose DiffPure that uses diffusion models for adversarial purification: Given an adversarial example, we first diffuse it with a small amount of noise following a forward diffusion process, and then recover the clean image through a reverse generative process. To evaluate our method against strong adaptive attacks in an efficient and scalable way, we propose to use the adjoint method to compute full gradients of the reverse generative process. Extensive experiments on three image datasets including CIFAR-10, ImageNet and CelebA-HQ with three classifier architectures including ResNet, WideResNet and ViT demonstrate that our method achieves the state-of-the-art results, outperforming current adversarial training and adversarial purification methods, often by a large margin. Project page: https://diffpure.github.io.
翻译:反面净化是指使用基因模型来消除对抗性扰动的防御方法。 这些方法并不对攻击形式和分类模型进行假设,因此可以保护先前存在的分类者免受无形威胁。 但是,它们的表现目前落后于对抗性培训方法。 在这项工作中,我们提议DiffPure使用对抗性净化的传播模式:根据一个对抗性例子,我们首先在前方扩散过程之后以少量噪音来扩散它,然后通过反向基因化过程恢复干净的图像。为了以高效和可伸缩的方式评估我们对付强烈适应性攻击的方法,我们提议使用联合方法来计算反向基因过程的全部梯度。对三种图像数据集进行广泛的实验,包括CIFAR-10、图像网和CelibA-HQ, 有三个分类结构,包括ResNet、WideResNet和ViT, 表明我们的方法取得了最新的结果,优于目前的对抗性培训和对抗性净化方法,往往以大幅度进行。 项目页: https://diffple.gibre.gibio.