Deep Neural Networks (DNNs) have been shown vulnerable to Test-Time Evasion attacks (TTEs, or adversarial examples), which, by making small changes to the input, alter the DNN's decision. We propose an unsupervised attack detector on DNN classifiers based on class-conditional Generative Adversarial Networks (GANs). We model the distribution of clean data conditioned on the predicted class label by an Auxiliary Classifier GAN (AC-GAN). Given a test sample and its predicted class, three detection statistics are calculated based on the AC-GAN Generator and Discriminator. Experiments on image classification datasets under various TTE attacks show that our method outperforms previous detection methods. We also investigate the effectiveness of anomaly detection using different DNN layers (input features or internal-layer features) and demonstrate, as one might expect, that anomalies are harder to detect using features closer to the DNN's output layer.
翻译:深神经网络(DNN)被显示易受测试时的突袭攻击(TTEs,或对抗性例子),通过对输入进行小改动,改变DNN的决定。我们提议根据等级条件生成反反转网络(GANs)对DNN分类器进行不受监督的攻击探测器。我们用辅助分类器GAN(AC-GAN)的预测等级标签来模拟清洁数据的分布。根据测试样本及其预测的类别,根据AC-GAN生成器和干扰器计算出三个探测统计数据。在TTE多次攻击下对图像分类数据集的实验表明,我们的方法比以前的探测方法要好。我们还利用不同的DNNN(输入特征或内部特征)来调查异常检测的效能,并如人们所预期的那样表明,异常现象比较难以使用靠近DNNN输出层的特征探测。