This paper proposes a simple method to distill and detect backdoor patterns within an image: \emph{Cognitive Distillation} (CD). The idea is to extract the "minimal essence" from an input image responsible for the model's prediction. CD optimizes an input mask to extract a small pattern from the input image that can lead to the same model output (i.e., logits or deep features). The extracted pattern can help understand the cognitive mechanism of a model on clean vs. backdoor images and is thus called a \emph{Cognitive Pattern} (CP). Using CD and the distilled CPs, we uncover an interesting phenomenon of backdoor attacks: despite the various forms and sizes of trigger patterns used by different attacks, the CPs of backdoor samples are all surprisingly and suspiciously small. One thus can leverage the learned mask to detect and remove backdoor examples from poisoned training datasets. We conduct extensive experiments to show that CD can robustly detect a wide range of advanced backdoor attacks. We also show that CD can potentially be applied to help detect potential biases from face datasets. Code is available at \url{https://github.com/HanxunH/CognitiveDistillation}.
翻译:本文提出了一个在图像中蒸馏和检测后门图案的简单方法: \ emph{ ognitive蒸馏} (CD) 。 想法是从一个负责模型预测的输入图像中提取“ 最小精髓” 。 CD 优化一个输入掩码, 从输入图像中提取一个小图案, 从而可以导致相同的模型输出( 即, 登入或深度特性) 。 提取的图案可以帮助理解清洁图像和后门图像模型的认知机制, 因而被称为 emph{ ognitive 样板 (CP) 。 使用 CD 和 蒸馏的CP, 我们发现了一个有趣的后门攻击现象: 尽管不同攻击所使用的触发模式的形式和大小不同, 后门样本的CP 都惊人和可疑的小。 因此, 可以利用所学的掩码来探测和删除毒害培训数据集中的后门示例。 我们进行广泛的实验, 以证明CD 可以强有力地探测到一系列先进的后门攻击。 我们还表明, CD CD 可以应用 来帮助探测面数据集/ / QIS 。 。 。