The vulnerabilities of deep neural networks against adversarial examples have become a significant concern for deploying these models in sensitive domains. Devising a definitive defense against such attacks is proven to be challenging, and the methods relying on detecting adversarial samples are only valid when the attacker is oblivious to the detection mechanism. In this paper we propose a principled adversarial example detection method that can withstand norm-constrained white-box attacks. Inspired by one-versus-the-rest classification, in a K class classification problem, we train K binary classifiers where the i-th binary classifier is used to distinguish between clean data of class i and adversarially perturbed samples of other classes. At test time, we first use a trained classifier to get the predicted label (say k) of the input, and then use the k-th binary classifier to determine whether the input is a clean sample (of class k) or an adversarially perturbed example (of other classes). We further devise a generative approach to detecting/classifying adversarial examples by interpreting each binary classifier as an unnormalized density model of the class-conditional data. We provide comprehensive evaluation of the above adversarial example detection/classification methods, and demonstrate their competitive performances and compelling properties.
翻译:深度神经网络对对抗性白箱攻击的脆弱性已成为在敏感领域部署这些模型的一个重大关切问题。 事实证明,针对这类攻击制定明确的防御方法具有挑战性,而依靠检测对抗性样品的方法只有在攻击者忽略了检测机制的情况下才有效。 在本文中,我们提出了一个原则性对抗性示范检测方法,能够经受规范限制的白箱攻击。在K类分类问题中,受一反一反一反一反一反的分类的启发,我们培训了K二进制分类器,其中i-th二进制分类器用来区分清洁的i类数据和其他类对抗性渗透性抽样数据。在测试时,我们首先使用经过培训的分类器获得输入的预测标签(k),然后使用k-th二进制分类器来确定输入是干净的样本(k)还是敌对性过激的样本(其他类)。我们进一步设计一种基因化方法,通过将每个二进制分类分类的样本分为一等分解为非固定性密度模型,从而展示了等级性检测和竞争性数据。我们提供了一种全面的评估方法。