Trojan backdoor is a poisoning attack against Neural Network (NN) classifiers in which adversaries try to exploit the (highly desirable) model reuse property to implant Trojans into model parameters for backdoor breaches through a poisoned training process. Most of the proposed defenses against Trojan attacks assume a white-box setup, in which the defender either has access to the inner state of NN or is able to run back-propagation through it. In this work, we propose a more practical black-box defense, dubbed TrojDef, which can only run forward-pass of the NN. TrojDef tries to identify and filter out Trojan inputs (i.e., inputs augmented with the Trojan trigger) by monitoring the changes in the prediction confidence when the input is repeatedly perturbed by random noise. We derive a function based on the prediction outputs which is called the prediction confidence bound to decide whether the input example is Trojan or not. The intuition is that Trojan inputs are more stable as the misclassification only depends on the trigger, while benign inputs will suffer when augmented with noise due to the perturbation of the classification features. Through mathematical analysis, we show that if the attacker is perfect in injecting the backdoor, the Trojan infected model will be trained to learn the appropriate prediction confidence bound, which is used to distinguish Trojan and benign inputs under arbitrary perturbations. However, because the attacker might not be perfect in injecting the backdoor, we introduce a nonlinear transform to the prediction confidence bound to improve the detection accuracy in practical settings. Extensive empirical evaluations show that TrojDef significantly outperforms the-state-of-the-art defenses and is highly stable under different settings, even when the classifier architecture, the training process, or the hyper-parameters change.
翻译:TrojDef试图通过监测预测信心的变化, 当输入反复受到随机噪音的干扰时, 大部分针对Trojan袭击的拟议防御假设是白箱设置, 捍卫者要么可以接触到NNN的内部状态, 要么能够通过它进行反向分析。 在这项工作中, 我们提出一个更实用的黑箱防御, 称为TrojDef, 只能追溯NN的准确性。 TrojDef试图通过检测来识别和过滤Trojan的预测输入( 即, 投入随着Trojan的触发而增加) 。 当输入反复受到随机噪音的干扰时, 大部分针对Trojan袭击的拟议防御假定是白箱设置。 我们根据预测输出产生一个函数, 即需要预测信心约束来决定输入是否是Trojan。 我们的直觉是, 木座输入更加精确, 因为错误的分类只能依靠NNNE。 Trojural的精确度变化, 而良性输入将随着攻击的噪音的增加而受到影响。