The opacity of neural networks leads their vulnerability to backdoor attacks, where hidden attention of infected neurons is triggered to override normal predictions to the attacker-chosen ones. In this paper, we propose a novel backdoor defense method to mark and purify the infected neurons in the backdoored neural networks. Specifically, we first define a new metric, called benign salience. By combining the first-order gradient to retain the connections between neurons, benign salience can identify the infected neurons with higher accuracy than the commonly used metric in backdoor defense. Then, a new Adaptive Regularization (AR) mechanism is proposed to assist in purifying these identified infected neurons via fine-tuning. Due to the ability to adapt to different magnitudes of parameters, AR can provide faster and more stable convergence than the common regularization mechanism in neuron purifying. Extensive experimental results demonstrate that our method can erase the backdoor in neural networks with negligible performance degradation.
翻译:神经网络的不透明性导致它们易受后门攻击, 在那里, 被感染神经元的隐性注意力被触发, 以推翻攻击者选择的神经元的正常预测。 在本文中, 我们提出一种新的后门防御方法, 以标记和净化后门神经网络中被感染的神经元。 具体地说, 我们首先定义了一个新的指标, 称为良性显眼。 通过将第一级梯度结合以保持神经元之间的联系, 良性显赫可以识别被感染的神经元, 其精确度高于后门防御中常用的公制。 然后, 提议一个新的适应性常规化机制(AR) 来帮助通过微调净化这些被确认感染的神经元。 由于能够适应不同程度的参数, AR 能够提供比神经净化中常见的正规化机制更快和更加稳定的趋同。 广泛的实验结果表明, 我们的方法可以在神经网络中清除后门, 并且可忽略的性能退化。