This paper proposes a new approach to detecting neural Trojans on Deep Neural Networks during inference. This approach is based on monitoring the inference of a machine learning model, computing the attribution of the model's decision on different features of the input, and then statistically analyzing these attributions to detect whether an input sample contains the Trojan trigger. The anomalous attributions, aka misattributions, are then accompanied by reverse-engineering of the trigger to evaluate whether the input sample is truly poisoned with a Trojan trigger. We evaluate our approach on several benchmarks, including models trained on MNIST, Fashion MNIST, and German Traffic Sign Recognition Benchmark, and demonstrate the state of the art detection accuracy.
翻译:本文提出了在推断过程中探测深神经网络神经质质谱的新方法,其基础是监测机器学习模型的推论,计算模型关于输入的不同特点的决定的归属,然后从统计角度分析这些属性,以检测输入样本是否包含Trojan触发器。异常的归属,即错误的归属,随后伴随着对触发器的反向工程,以评价输入样本是否确实被Trojan触发器毒害。我们评估了我们在若干基准上的方法,包括经培训的MNIST模型、Fashon MNIST模型和德国交通信号识别基准,并展示了最新检测的准确性。