A Backdoor attack (BA) is an important type of adversarial attack against deep neural network classifiers, wherein test samples from one or more source classes will be (mis)classified to the attacker's target class when a backdoor pattern (BP) is embedded. In this paper, we focus on the post-training backdoor defense scenario commonly considered in the literature, where the defender aims to detect whether a trained classifier was backdoor attacked, without any access to the training set. To the best of our knowledge, existing post-training backdoor defenses are all designed for BAs with presumed BP types, where each BP type has a specific embedding function. They may fail when the actual BP type used by the attacker (unknown to the defender) is different from the BP type assumed by the defender. In contrast, we propose a universal post-training defense that detects BAs with arbitrary types of BPs, without making any assumptions about the BP type. Our detector leverages the influence of the BA, independently of the BP type, on the landscape of the classifier's outputs prior to the softmax layer. For each class, a maximum margin statistic is estimated using a set of random vectors; detection inference is then performed by applying an unsupervised anomaly detector to these statistics. Thus, our detector is also an advance relative to most existing post-training methods by not needing any legitimate clean samples, and can efficiently detect BAs with arbitrary numbers of source classes. These advantages of our detector over several state-of-the-art methods are demonstrated on four datasets, for three different types of BPs, and for a variety of attack configurations. Finally, we propose a novel, general approach for BA mitigation once a detection is made.
翻译:幕后攻击( BA) 是针对深层神经网络分类器的一种重要的对抗性攻击( BA ), 其中一个或多个源级的测试样本在嵌入后门模式( BP ) 时会被( 误) 归类为攻击者的目标类别。 在本文中, 我们关注文献中通常考虑的训练后后后后门防御情景, 捍卫者的目的是检测受过训练的分类器是否受到后门攻击, 但没有获得任何培训成套设备。 根据我们的知识, 现有的训练后门防御设备都是为 BA 设计的, 其中每个 BP 类型都有特定的嵌入功能。 当攻击者( 捍卫者不知道) 使用的实际 BP 类型与捍卫者所假设的 BP 类型不同时, 他们可能会失败。 相反, 我们提出的通用的通用培训后门防御设备是否被任意攻击, 一旦对 BP 类型做出任何假设, 我们的检测后后后门防御系统将利用BP 类型, 在分类攻击前的图像中, 将使用最任意的 BP 类型,, 使用最高级的检测方法, 将显示我们现有的 BSVA 等级 的 。