Deep learning models are vulnerable to adversarial examples and make incomprehensible mistakes, which puts a threat on their real-world deployment. Combined with the idea of adversarial training, preprocessing-based defenses are popular and convenient to use because of their task independence and good generalizability. Current defense methods, especially purification, tend to remove ``noise" by learning and recovering the natural images. However, different from random noise, the adversarial patterns are much easier to be overfitted during model training due to their strong correlation to the images. In this work, we propose a novel adversarial purification scheme by presenting disentanglement of natural images and adversarial perturbations as a preprocessing defense. With extensive experiments, our defense is shown to be generalizable and make significant protection against unseen strong adversarial attacks. It reduces the success rates of state-of-the-art \textbf{ensemble} attacks from \textbf{61.7\%} to \textbf{14.9\%} on average, superior to a number of existing methods. Notably, our defense restores the perturbed images perfectly and does not hurt the clean accuracy of backbone models, which is highly desirable in practice.
翻译:深层次的学习模式很容易受到对抗性例子的伤害,并造成无法理解的错误,从而威胁到其真实世界的部署。结合对抗性培训的想法,以预处理为基础的防御由于其任务的独立性和良好的可概括性而受欢迎和方便使用。当前的防御方法,特别是净化,往往通过学习和恢复自然图像而消除“噪音”。然而,与随机噪音不同,在模拟培训期间,对抗模式更容易被过度使用,因为它们与图像有很强的关联。在这项工作中,我们提出一个新的对抗性净化计划,将自然图像的分解和对立性扰动作为预处理的防御。通过广泛的实验,我们的防御被证明是普遍的,对看不见的强烈对抗性攻击提供了重要的保护。它降低了最先进的“textbf{ensemble”攻击的成功率,从\ textb{61.7\\\\\\\\\\ textbf{149.\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\