This paper addresses the problem of detecting trojans in neural networks (NNs) by analyzing systematically pruned NN models. Our pruning-based approach consists of three main steps. First, detect any deviations from the reference look-up tables of model file sizes and model graphs. Next, measure the accuracy of a set of systematically pruned NN models following multiple pruning schemas. Finally, classify a NN model as clean or poisoned by applying a mapping between accuracy measurements and NN model labels. This work outlines a theoretical and experimental framework for finding the optimal mapping over a large search space of pruning parameters. Based on our experiments using Round 1 and Round 2 TrojAI Challenge datasets, the approach achieves average classification accuracy of 69.73 % and 82.41% respectively with an average processing time of less than 60 s per model. For both datasets random guessing would produce 50% classification accuracy. Reference model graphs and source code are available from GitHub.
翻译:本文通过系统分析精度测量和NN模型标签之间的映射,解决神经网络中探测trojans的问题。 我们的运行方法由三个主要步骤组成。 首先, 检测与模型文件大小和模型图的参考搜索表格的任何偏差。 下一步, 测量一组系统运行的NN模型的准确性, 并采用多个运行模型。 最后, 将NN模型分类为清洁或中毒, 方法是在精确度测量和NNN模型标签之间进行测绘。 这项工作概述了一个理论和实验框架, 用于在大范围的搜索空间中找到最佳绘图。 根据我们使用第1轮和第2轮TrojAI挑战数据集进行的实验, 这种方法达到平均分类精确度分别为69.73%和82.41%, 平均处理时间小于每模型60秒。 对于两个数据集, 随机猜测将产生50%的分类准确性。 参考模型图表和源代码可从 GitHub 获得。