Deep neural networks are highly susceptible to adversarial attacks, which pose significant risks to security- and safety-critical applications. We present KoALA (KL-L0 Adversarial detection via Label Agreement), a novel, semantics-free adversarial detector that requires no architectural changes or adversarial retraining. KoALA operates on a simple principle: it detects an adversarial attack when class predictions from two complementary similarity metrics disagree. These metrics-KL divergence and an L0-based similarity-are specifically chosen to detect different types of perturbations. The KL divergence metric is sensitive to dense, low-amplitude shifts, while the L0-based similarity is designed for sparse, high-impact changes. We provide a formal proof of correctness for our approach. The only training required is a simple fine-tuning step on a pre-trained image encoder using clean images to ensure the embeddings align well with both metrics. This makes KOALA a lightweight, plug-and-play solution for existing models and various data modalities. Our extensive experiments on ResNet/CIFAR-10 and CLIP/Tiny-ImageNet confirm our theoretical claims. When the theorem's conditions are met, KoALA consistently and effectively detects adversarial examples. On the full test sets, KoALA achieves a precision of 0.94 and a recall of 0.81 on ResNet/CIFAR-10, and a precision of 0.66 and a recall of 0.85 on CLIP/Tiny-ImageNet.
翻译:深度神经网络极易受到对抗攻击,这对安全攸关的应用构成重大风险。我们提出KoALA(基于标签一致性的KL-L0对抗检测器),这是一种新颖的、无需语义信息的对抗检测器,无需修改网络架构或进行对抗性重训练。KoALA基于一个简单原理运作:当两种互补相似性度量的类别预测不一致时,即检测到对抗攻击。这两种度量——KL散度与基于L0的相似度——被特意选择以检测不同类型的扰动。KL散度度量对密集、低幅度的偏移敏感,而基于L0的相似度则专为稀疏、高影响的改变设计。我们为此方法提供了形式化的正确性证明。所需的唯一训练是使用干净图像对预训练图像编码器进行简单的微调步骤,以确保嵌入向量与两种度量良好对齐。这使得KoALA成为一种轻量级、即插即用的解决方案,适用于现有模型及多种数据模态。我们在ResNet/CIFAR-10和CLIP/Tiny-ImageNet上的大量实验验证了我们的理论主张。当定理条件满足时,KoALA能持续有效地检测对抗样本。在完整测试集上,KoALA在ResNet/CIFAR-10上实现了0.94的精确率和0.81的召回率,在CLIP/Tiny-ImageNet上实现了0.66的精确率和0.85的召回率。