Predicitions made by neural networks can be fraudulently altered by so-called poisoning attacks. A special case are backdoor poisoning attacks. We study suitable detection methods and introduce a new method called Heatmap Clustering. There, we apply a $k$-means clustering algorithm on heatmaps produced by the state-of-the-art explainable AI method Layer-wise relevance propagation. The goal is to separate poisoned from un-poisoned data in the dataset. We compare this method with a similar method, called Activation Clustering, which also uses $k$-means clustering but applies it on the activation of certain hidden layers of the neural network as input. We test the performance of both approaches for standard backdoor poisoning attacks, label-consistent poisoning attacks and label-consistent poisoning attacks with reduced amplitude stickers. We show that Heatmap Clustering consistently performs better than Activation Clustering. However, when considering label-consistent poisoning attacks, the latter method also yields good detection performance.
翻译:神经网络的预测可能被所谓的中毒袭击以欺诈方式改变。 一个特殊案例是幕后中毒袭击。 我们研究适当的检测方法, 并引入了一种叫做热马普聚集的新方法。 我们在那里对由最先进的可解释的AI 方法产生的热图应用了以美元为单位的组合算法。 我们的目标是将中毒与数据集中未受污染的数据区分开来。 我们把这种方法与类似的方法, 叫做“活动聚集” 比较, 这种方法也使用以美元为单位的集聚, 但它用于激活神经网络的某些隐蔽层, 作为输入。 我们测试标准后门中毒袭击、 标签一致的中毒袭击和标签一致的中毒袭击的两种方法的性能, 减少粘贴贴粘贴剂。 我们显示, 热马集持续运行优于活动聚集。 但是, 在考虑标签一致的中毒袭击时, 后一种方法也能产生良好的检测性能。