PerD: 国家实验室方案应用的渗透感敏度基于神经特洛伊探测框架 (PerD: Perturbation Sensitivity-based Neural Trojan Detection Framework on NLP Applications)

Deep Neural Networks (DNNs) have been shown to be susceptible to Trojan attacks. Neural Trojan is a type of targeted poisoning attack that embeds the backdoor into the victim and is activated by the trigger in the input space. The increasing deployment of DNNs in critical systems and the surge of outsourcing DNN training (which makes Trojan attack easier) makes the detection of Trojan attacks necessary. While Neural Trojan detection has been studied in the image domain, there is a lack of solutions in the NLP domain. In this paper, we propose a model-level Trojan detection framework by analyzing the deviation of the model output when we introduce a specially crafted perturbation to the input. Particularly, we extract the model's responses to perturbed inputs as the `signature' of the model and train a meta-classifier to determine if a model is Trojaned based on its signature. We demonstrate the effectiveness of our proposed method on both a dataset of NLP models we create and a public dataset of Trojaned NLP models from TrojAI. Furthermore, we propose a lightweight variant of our detection method that reduces the detection time while preserving the detection rates.

翻译：深心神经网络(DNNs)被证明很容易受到Trojan攻击。 Neural Trojan是一种有针对性的中毒袭击,将后门嵌入受害者体内,并被输入空间的触发触发。在关键系统中越来越多地部署DNN(DNN),外包DNN培训(使Trojan攻击更加容易)的激增,使得有必要探测Trojan攻击。虽然在图像领域对神经Trojan探测进行了研究,但在NLP域内缺乏解决办法。在本文中,我们提出一个模型级Trojan检测框架,通过分析模型输出的偏差,我们引入了特别设计的输入空间的扰动。特别是,我们提取了模型对渗透输入的反应,作为模型的“签名”,并培训了一个元分类器,以确定模型是否基于签名而安装了Trojan攻击。我们用NLP模型的数据集以及TrojAI的Trojan NLP模型的公开数据集,展示了我们拟议方法的有效性。此外,我们提出了一种检测方法的轻量变量,同时保持探测率。

相关内容

MoDELS

关注 43

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/