Deep learning models are vulnerable to backdoor poisoning attacks. In particular, adversaries can embed hidden backdoors into a model by only modifying a very small portion of its training data. On the other hand, it has also been commonly observed that backdoor poisoning attacks tend to leave a tangible signature in the latent space of the backdoored model i.e. poison samples and clean samples form two separable clusters in the latent space. These observations give rise to the popularity of latent separability assumption, which states that the backdoored DNN models will learn separable latent representations for poison and clean populations. A number of popular defenses (e.g. Spectral Signature, Activation Clustering, SCAn, etc.) are exactly built upon this assumption. However, in this paper, we show that the latent separation can be significantly suppressed via designing adaptive backdoor poisoning attacks with more sophisticated poison strategies, which consequently render state-of-the-art defenses based on this assumption less effective (and often completely fail). More interestingly, we find that our adaptive attacks can even evade some other typical backdoor defenses that do not explicitly build on this separability assumption. Our results show that adaptive backdoor poisoning attacks that can breach the latent separability assumption should be seriously considered for evaluating existing and future defenses.
翻译:深层次的学习模型很容易受到后门中毒袭击。 特别是, 对手可以通过修改其培训数据中的极小部分将隐藏的后门嵌入模型。 另一方面, 人们也经常看到, 后门中毒袭击往往在后门模式的潜在空间留下一个有形的标志, 即毒物样本和干净的样本形成潜伏空间中的两个可分离的组群。 这些观察导致潜在分离假设的受欢迎程度, 表明后门的DNN模型将了解毒物和清洁人群的可分离的潜在表现。 一些流行的后门防御( 例如, 光谱签名、 活动集群、 SCAn等 ) 完全建立在这种假设之上。 然而, 在本文中,我们表明,通过设计适应性后门中毒袭击的策略, 从而使得基于这种假设的状态防御变得不那么有效( 往往完全失败 ) 。 更有趣的是, 我们的适应性袭击甚至可以回避一些其他典型的后门防御手段( 例如, 光谱签名、 动作组合、 SCAN 等 等 ) 。