TAD: 用于AI 的基于黑盒Trojan探测黑盒的触发接近 (TAD: Trigger Approximation based Black-box Trojan Detection for AI)

An emerging amount of intelligent applications have been developed with the surge of Machine Learning (ML). Deep Neural Networks (DNNs) have demonstrated unprecedented performance across various fields such as medical diagnosis and autonomous driving. While DNNs are widely employed in security-sensitive fields, they are identified to be vulnerable to Neural Trojan (NT) attacks that are controlled and activated by the stealthy trigger. We call this vulnerable model adversarial artificial intelligence (AI). In this paper, we target to design a robust Trojan detection scheme that inspects whether a pre-trained AI model has been Trojaned before its deployment. Prior works are oblivious of the intrinsic property of trigger distribution and try to reconstruct the trigger pattern using simple heuristics, i.e., stimulating the given model to incorrect outputs. As a result, their detection time and effectiveness are limited. We leverage the observation that the pixel trigger typically features spatial dependency and propose TAD, the first trigger approximation based Trojan detection framework that enables fast and scalable search of the trigger in the input space. Furthermore, TAD can also detect Trojans embedded in the feature space where certain filter transformations are used to activate the Trojan. We perform extensive experiments to investigate the performance of the TAD across various datasets and ML models. Empirical results show that TAD achieves a ROC-AUC score of 0:91 on the public TrojAI dataset 1 and the average detection time per model is 7:1 minutes.

翻译：深神经网络(DNNs)在医疗诊断和自主驾驶等各个领域表现出前所未有的业绩。虽然DNNs被广泛用于安全敏感领域,但被确定为易受由隐形触发器控制和激活的神经特洛伊(NT)攻击的伤害。我们称之为这种脆弱的模型对抗性对抗人造智能(AI)。在本文件中,我们的目标是设计一个强大的特洛伊探测机制,以检查在部署之前是否已经使用了经过预先训练的AI模型。先前的工程对触发分布的内在属性不为人知,并试图利用简单的超光速模型重建触发模式。结果是,它们的检测时间和效力有限。我们利用这样的观察,即像素触发的通常是空间依赖,并提出以TAD为主的第一个基于触发性基的Trojan91探测框架,以便能够在投入空间中快速和可缩放地搜索触发点。此外,TADAD还可以探测在地段空间内嵌入的Trojans,在其中使用简单的超时空模型,即刺激给错误输出模型的模型。我们利用了TROAAAU的某种平均测试结果。