Deep Neural Networks (DNNs) can learn Trojans (or backdoors) from benign or poisoned data, which raises security concerns of using them. By exploiting such Trojans, the adversary can add a fixed input space perturbation to any given input to mislead the model predicting certain outputs (i.e., target labels). In this paper, we analyze such input space Trojans in DNNs, and propose a theory to explain the relationship of a model's decision regions and Trojans: a complete and accurate Trojan corresponds to a hyperplane decision region in the input domain. We provide a formal proof of this theory, and provide empirical evidence to support the theory and its relaxations. Based on our analysis, we design a novel training method that removes Trojans during training even on poisoned datasets, and evaluate our prototype on five datasets and five different attacks. Results show that our method outperforms existing solutions. Code: \url{https://anonymous.4open.science/r/NOLE-84C3}.
翻译:深神经网络( DNNS) 可以从良性或毒害数据中学习Trojans( 或后门 ), 这引起了使用这些数据的安全顾虑。 通过利用这些Trojans, 对手可以在任何输入中添加固定输入空间扰动, 以误导预测某些输出( 目标标签 ) 的模型 。 在本文中, 我们分析了 DNS 中的这种输入空间, 并提出了一个解释模型决策区与Trojans之间的关系的理论: 完整和准确的Trojan 匹配输入域中的超空机决策区 。 我们提供了这一理论的正式证据, 并提供经验证据支持理论及其放松。 我们根据我们的分析设计了一个新颖的培训方法, 将Trojans除去, 甚至在关于有毒数据集的培训中, 并评估我们关于五个数据集和五个不同攻击的原型。 结果显示, 我们的方法超越了现有解决方案 。 代码 :\ urls:// anonymous.4open. science/r/ NOLE-84C3} 。