Deep Neural Networks (DNNs) are often criticized for being susceptible to adversarial attacks. Most successful defense strategies adopt adversarial training or random input transformations that typically require retraining or fine-tuning the model to achieve reasonable performance. In this work, our investigations of intermediate representations of a pre-trained DNN lead to an interesting discovery pointing to intrinsic robustness to adversarial attacks. We find that we can learn a generative classifier by statistically characterizing the neural response of an intermediate layer to clean training samples. The predictions of multiple such intermediate-layer based classifiers, when aggregated, show unexpected robustness to adversarial attacks. Specifically, we devise an ensemble of these generative classifiers that rank-aggregates their predictions via a Borda count-based consensus. Our proposed approach uses a subset of the clean training data and a pre-trained model, and yet is agnostic to network architectures or the adversarial attack generation method. We show extensive experiments to establish that our defense strategy achieves state-of-the-art performance on the ImageNet validation set.
翻译:大部分成功的国防战略都采用对抗性培训或随机输入转换,通常需要再培训或微调模型才能达到合理的性能。在这项工作中,我们对经过训练的DNN中间表示的调查导致一个有趣的发现,表明对抗性攻击具有内在的强力。我们发现,我们可以通过将中间层神经反应对清洁培训样本的统计特征来学习基因化分类。对多个以中间层为基础的分类器的预测,在汇总时,显示对对抗性攻击的意外强力。具体地说,我们设计了这些基因分类器,通过博尔达的计数共识将其预测分级汇总。我们提议的方法使用了清洁培训数据和预先训练模型的子集,但对于网络结构或对抗性攻击生成方法来说是不可知的。我们进行了广泛的实验,以确定我们的防御战略在图像网络验证集上取得了最先进的表现。