We study backdoor poisoning attacks against image classification networks, whereby an attacker inserts a trigger into a subset of the training data, in such a way that at test time, this trigger causes the classifier to predict some target class. %There are several techniques proposed in the literature that aim to detect the attack but only a few also propose to defend against it, and they typically involve retraining the network which is not always possible in practice. We propose lightweight automated detection and correction techniques against poisoning attacks, which are based on neuron patterns mined from the network using a small set of clean and poisoned test samples with known labels. The patterns built based on the mis-classified samples are used for run-time detection of new poisoned inputs. For correction, we propose an input correction technique that uses a differential analysis to identify the trigger in the detected poisoned images, which is then reset to a neutral color. Our detection and correction are performed at run-time and input level, which is in contrast to most existing work that is focused on offline model-level defenses. We demonstrate that our technique outperforms existing defenses such as NeuralCleanse and STRIP on popular benchmarks such as MNIST, CIFAR-10, and GTSRB against the popular BadNets attack and the more complex DFST attack.
翻译:我们研究的是针对图像分类网络的后门中毒袭击,攻击者据此将触发器插入培训数据的一个子集,这种方式在测试时导致分类者预测某些目标类别。%文献中提出了一些旨在检测袭击的技术,但只有少数人还提议防范袭击,这些技术通常涉及网络的再培训,而在实践中并不总是可能这样做。我们提议了轻量自动检测和纠正方法,以打击中毒袭击,这些技术是以网络中利用一组已知标签的清洁和有毒神经模式提取的神经模式为基础的。基于错误分类样本的模型被用于运行时间检测新的有毒投入。为了纠正,我们建议了一种输入校正技术,使用差异分析来确定检测到的有毒图像中的触发点,然后将其重新设定为中性颜色。我们的检测和纠正是在运行和输入水平上进行的,这与大多数现有的侧重于离线模型防御的工作形成对照。我们用错误分类样本构建的模型模型模型模型模型模型模型模型,用来用于运行现有防御系统,用于运行时间检测新的有毒投入。关于更正,我们提出的输入校正技术,使用一种使用差异分析技术,用以识别所检测的有毒图像中的触发点,然后重新设定中色图像。我们的GMIS系统和FRASTIS-10等通用基准,用于BMIS-10号的常规攻击。