Backdoor attacks are emerging threats to deep neural networks, which typically embed malicious behaviors into a victim model by injecting poisoned samples. Adversaries can activate the injected backdoor during inference by presenting the trigger on input images. Prior defensive methods have achieved remarkable success in countering dirty-label backdoor attacks where the labels of poisoned samples are often mislabeled. However, these approaches do not work for a recent new type of backdoor -- clean-label backdoor attacks that imperceptibly modify poisoned data and hold consistent labels. More complex and powerful algorithms are demanded to defend against such stealthy attacks. In this paper, we propose UltraClean, a general framework that simplifies the identification of poisoned samples and defends against both dirty-label and clean-label backdoor attacks. Given the fact that backdoor triggers introduce adversarial noise that intensifies in feed-forward propagation, UltraClean first generates two variants of training samples using off-the-shelf denoising functions. It then measures the susceptibility of training samples leveraging the error amplification effect in DNNs, which dilates the noise difference between the original image and denoised variants. Lastly, it filters out poisoned samples based on the susceptibility to thwart the backdoor implantation. Despite its simplicity, UltraClean achieves a superior detection rate across various datasets and significantly reduces the backdoor attack success rate while maintaining a decent model accuracy on clean data, outperforming existing defensive methods by a large margin. Code is available at https://github.com/bxz9200/UltraClean.
翻译:后门攻击是深度神经网络面临的新兴威胁,通常通过注入毒化样本来将恶意行为嵌入受害模型。攻击者可在推理阶段通过在输入图像上呈现触发器来激活植入的后门。现有防御方法在应对脏标签后门攻击(即毒化样本的标签常被错误标注)方面已取得显著成功,但这些方法无法应对近期出现的新型后门——干净标签后门攻击,该类攻击以难以察觉的方式修改毒化数据并保持标签一致性。防御此类隐蔽攻击需要更复杂、更强大的算法。本文提出UltraClean,一种通用框架,可简化毒化样本的识别过程,并同时防御脏标签与干净标签后门攻击。基于后门触发器会引入在前向传播过程中加剧的对抗性噪声这一事实,UltraClean首先利用现成的去噪函数生成训练样本的两个变体。随后,通过深度神经网络中的误差放大效应来度量训练样本的易感性,该效应会扩大原始图像与去噪变体之间的噪声差异。最后,根据易感性筛选出毒化样本以阻止后门植入。尽管设计简洁,UltraClean在多个数据集上实现了卓越的检测率,显著降低了后门攻击成功率,同时在干净数据上保持了良好的模型精度,其性能远超现有防御方法。代码发布于https://github.com/bxz9200/UltraClean。