Backdoor attacks inject poisoned data into the training set, resulting in misclassification of the poisoned samples during model inference. Defending against such attacks is challenging, especially in real-world black-box settings where only model predictions are available. In this paper, we propose a novel backdoor defense framework that can effectively defend against various attacks through zero-shot image purification (ZIP). Our proposed framework can be applied to black-box models without requiring any internal information about the poisoned model or any prior knowledge of the clean/poisoned samples. Our defense framework involves a two-step process. First, we apply a linear transformation on the poisoned image to destroy the trigger pattern. Then, we use a pre-trained diffusion model to recover the missing semantic information removed by the transformation. In particular, we design a new reverse process using the transformed image to guide the generation of high-fidelity purified images, which can be applied in zero-shot settings. We evaluate our ZIP backdoor defense framework on multiple datasets with different kinds of attacks. Experimental results demonstrate the superiority of our ZIP framework compared to state-of-the-art backdoor defense baselines. We believe that our results will provide valuable insights for future defense methods for black-box models.
翻译:后门攻击会向训练集注入有毒数据,导致模型在推理期间对有毒样本进行误分类。在黑匣子实际应用情况下,仅有模型预测值可用,因此防御此类攻击是具有挑战性的。本文提出了一种新的后门防御框架,通过零样本图像净化 (ZIP),可以有效地防御各种攻击。我们的提议框架可以应用于黑盒子模型,而不需要任何有关受污染模型的内部信息或任何先前关于清洁/污染样本的知识。我们的防御框架包括两个步骤。首先,我们对受污染图像进行线性变换以破坏触发模式。然后,我们使用预训练的扩散模型来恢复由变换去除的缺失语义信息。特别的,我们使用转换后的图像来指导生成高保真净化图像的新逆过程,该过程可以在零样本情况下应用。我们在多个数据集上以不同种类的攻击方式评估我们的ZIP后门防御框架。实验结果表明,与现有的后门防御方法相比,我们的ZIP框架优势明显。我们相信,我们的结果将为未来黑盒子模型的防御方法提供有价值的洞见。