We offer a study that connects robust discriminative classifiers trained with adversarial training (AT) with generative modeling in the form of Energy-based Models (EBM). We do so by decomposing the loss of a discriminative classifier and showing that the discriminative model is also aware of the input data density. Though a common assumption is that adversarial points leave the manifold of the input data, our study finds out that, surprisingly, untargeted adversarial points in the input space are very likely under the generative model hidden inside the discriminative classifier -- have low energy in the EBM. We present two evidence: untargeted attacks are even more likely than the natural data and their likelihood increases as the attack strength increases. This allows us to easily detect them and craft a novel attack called High-Energy PGD that fools the classifier yet has energy similar to the data set.
翻译:我们提供了一项研究,将用对抗训练(AT)训练的鲁棒性判别分类器与以能量为基础的生成模型(EBM)联系起来。我们通过分解判别式分类器的损失,并展示判别式模型也清楚地意识到输入数据的密度,从而实现了这一点。尽管普遍的假设是对抗点离开了输入数据的流形,但出乎意料的是,输入空间中未针对目标的对抗点非常可能在判别式分类器内隐含的生成模型下,即在EBM中具有较低能量。我们提供了两个证据:未成功的攻击甚至更容易出现在自然数据中,并且随着攻击强度的增加,它们的可能性会增加。这使我们能够轻松地检测它们并设计出一种新的攻击方法称为高能量PGD,它可以欺骗分类器但能量与数据集相似。