Recent studies show that despite achieving high accuracy on a number of real-world applications, deep neural networks (DNNs) can be backdoored: by injecting triggered data samples into the training dataset, the adversary can mislead the trained model into classifying any test data to the target class as long as the trigger pattern is presented. To nullify such backdoor threats, various methods have been proposed. Particularly, a line of research aims to purify the potentially compromised model. However, one major limitation of this line of work is the requirement to access sufficient original training data: the purifying performance is a lot worse when the available training data is limited. In this work, we propose Adversarial Weight Masking (AWM), a novel method capable of erasing the neural backdoors even in the one-shot setting. The key idea behind our method is to formulate this into a min-max optimization problem: first, adversarially recover the trigger patterns and then (soft) mask the network weights that are sensitive to the recovered patterns. Comprehensive evaluations of several benchmark datasets suggest that AWM can largely improve the purifying effects over other state-of-the-art methods on various available training dataset sizes.
翻译:最近的研究显示,尽管在一些现实应用中实现了高度准确性,但深神经网络(DNNS)可以被后门利用:通过将触发的数据样本输入培训数据集,敌手可以误导经过训练的模式,只要提出触发模式,就可以将任何测试数据分类到目标类别;为了消除这种后门威胁,已经提出了各种方法。特别是,为净化潜在受损模式而提出的一系列研究目标是净化潜在受损模式。然而,这项工作的一个主要局限性是获取足够原始培训数据的要求:在现有的培训数据有限的情况下,净化性性能要差得多。在这项工作中,我们提议采用Aversarial Wight Masking(AWM),这是一种新型方法,即使在一发式情况下也能将神经后门去除。我们的方法背后的关键思想是将此发展成一个微量最大优化问题:首先,对抗性恢复触发模式,然后(软)掩盖对已恢复模式敏感的网络重量。对若干基准数据集的全面评估表明,AWM可以大大改进其他现有数据规模培训方法的净化效果。