Deep Neural Networks (DNNs) have been shown to be susceptible to Trojan attacks. Neural Trojan is a type of targeted poisoning attack that embeds the backdoor into the victim and is activated by the trigger in the input space. The increasing deployment of DNNs in critical systems and the surge of outsourcing DNN training (which makes Trojan attack easier) makes the detection of Trojan attacks necessary. While Neural Trojan detection has been studied in the image domain, there is a lack of solutions in the NLP domain. In this paper, we propose a model-level Trojan detection framework by analyzing the deviation of the model output when we introduce a specially crafted perturbation to the input. Particularly, we extract the model's responses to perturbed inputs as the `signature' of the model and train a meta-classifier to determine if a model is Trojaned based on its signature. We demonstrate the effectiveness of our proposed method on both a dataset of NLP models we create and a public dataset of Trojaned NLP models from TrojAI. Furthermore, we propose a lightweight variant of our detection method that reduces the detection time while preserving the detection rates.
翻译:深心神经网络(DNNs)被证明很容易受到Trojan攻击。 Neural Trojan是一种有针对性的中毒袭击,将后门嵌入受害者体内,并被输入空间的触发触发。在关键系统中越来越多地部署DNN(DNN),外包DNN培训(使Trojan攻击更加容易)的激增,使得有必要探测Trojan攻击。虽然在图像领域对神经Trojan探测进行了研究,但在NLP域内缺乏解决办法。在本文中,我们提出一个模型级Trojan检测框架,通过分析模型输出的偏差,我们引入了特别设计的输入空间的扰动。特别是,我们提取了模型对渗透输入的反应,作为模型的“签名”,并培训了一个元分类器,以确定模型是否基于签名而安装了Trojan攻击。我们用NLP模型的数据集以及TrojAI的Trojan NLP模型的公开数据集,展示了我们拟议方法的有效性。此外,我们提出了一种检测方法的轻量变量,同时保持探测率。