NIP: 中上层反向干扰反向攻击 (NIP: Neuron-level Inverse Perturbation Against Adversarial Attacks)

Although deep learning models have achieved unprecedented success, their vulnerabilities towards adversarial attacks have attracted increasing attention, especially when deployed in security-critical domains. To address the challenge, numerous defense strategies, including reactive and proactive ones, have been proposed for robustness improvement. From the perspective of image feature space, some of them cannot reach satisfying results due to the shift of features. Besides, features learned by models are not directly related to classification results. Different from them, We consider defense method essentially from model inside and investigated the neuron behaviors before and after attacks. We observed that attacks mislead the model by dramatically changing the neurons that contribute most and least to the correct label. Motivated by it, we introduce the concept of neuron influence and further divide neurons into front, middle and tail part. Based on it, we propose neuron-level inverse perturbation(NIP), the first neuron-level reactive defense method against adversarial attacks. By strengthening front neurons and weakening those in the tail part, NIP can eliminate nearly all adversarial perturbations while still maintaining high benign accuracy. Besides, it can cope with different sizes of perturbations via adaptivity, especially larger ones. Comprehensive experiments conducted on three datasets and six models show that NIP outperforms the state-of-the-art baselines against eleven adversarial attacks. We further provide interpretable proofs via neuron activation and visualization for better understanding.

翻译：尽管深层次的学习模式取得了前所未有的成功,但它们对对抗性攻击的脆弱性已经引起越来越多的关注,特别是在安全关键领域部署时。为了应对这一挑战,已经提出了许多防御战略,包括反应性和主动性战略,以提升稳健性。从图像特征空间的角度来看,其中一些战略无法取得令人满意的结果。此外,模型所学习的特征与分类结果没有直接关系。不同的是,我们认为防御方法基本上来自模型,并调查攻击前后的神经行为。我们发现,攻击通过大幅改变最有助于正确标签的神经元来误导模型。我们受此驱动,我们引入了神经影响的概念,并将神经元进一步分为前、中和尾部部分。在此基础上,我们提出了神经性突扰动水平(NIP),这是第一种神经性反应性防御方法,与分类结果没有直接相关。我们考虑的是,通过强化前方神经元和削弱尾部部分的神经性行为,NIP可以消除几乎所有的对抗性侵扰动性行为,同时保持高清晰度的准确性。此外,我们能够应对神经性影响的概念影响概念的概念,并将神经性攻击进一步分成不同的尺寸分为六级的模型,通过调整更大规模的直观性试验,特别是直观性模型,展示。