Trojan attacks on deep neural networks are both dangerous and surreptitious. Over the past few years, Trojan attacks have advanced from using only a simple trigger and targeting only one class to using many sophisticated triggers and targeting multiple classes. However, Trojan defenses have not caught up with this development. Most defense methods still make out-of-date assumptions about Trojan triggers and target classes, thus, can be easily circumvented by modern Trojan attacks. In this paper, we advocate general defenses that are effective and robust against various Trojan attacks and propose two novel "filtering" defenses with these characteristics called Variational Input Filtering (VIF) and Adversarial Input Filtering (AIF). VIF and AIF leverage variational inference and adversarial training respectively to purify all potential Trojan triggers in the input at run time without making any assumption about their numbers and forms. We further extend "filtering" to "filtering-then-contrasting" - a new defense mechanism that helps avoid the drop in classification accuracy on clean data caused by filtering. Extensive experimental results show that our proposed defenses significantly outperform 4 well-known defenses in mitigating 5 different Trojan attacks including the two state-of-the-art which defeat many strong defenses.
翻译:对深层神经网络的Trojan攻击既危险又神秘。 在过去的几年里,Trojan攻击从只使用简单的触发器,只针对一个等级,而是使用许多尖端的触发器和多等级。然而,Trojan防御没有赶上这一发展。大多数防御方法仍然对Trojan触发器和目标类别作出过时的假设,因此,可以很容易地被现代Trojan攻击所绕过。在这份文件中,我们提倡针对各种Trojan攻击的有效和强力的一般防御,并提出了两种新型的“过滤”防御装置,其特征是“挥发式输入过滤器”和反向输入过滤器。VIF和AIF分别利用变异推力和对抗性训练来净化运行中所有潜在的Trojan触发器,而没有对其数量和形式作任何假设。我们进一步扩展“过滤”到“过滤式”——一种新的防御机制,它有助于避免过滤导致清洁数据的分类准确性下降。广泛的实验结果显示,在两次防御中,包括两起不同式的防御中,这已经明显地表明,我们的防御措施已经大大超越了。