Deep learning models are known to be vulnerable to adversarial examples that are elaborately designed for malicious purposes and are imperceptible to the human perceptual system. Autoencoder, when trained solely over benign examples, has been widely used for (self-supervised) adversarial detection based on the assumption that adversarial examples yield larger reconstruction error. However, because lacking adversarial examples in its training and the too strong generalization ability of autoencoder, this assumption does not always hold true in practice. To alleviate this problem, we explore to detect adversarial examples by disentangled representations of images under the autoencoder structure. By disentangling input images as class features and semantic features, we train an autoencoder, assisted by a discriminator network, over both correctly paired class/semantic features and incorrectly paired class/semantic features to reconstruct benign and counterexamples. This mimics the behavior of adversarial examples and can reduce the unnecessary generalization ability of autoencoder. Compared with the state-of-the-art self-supervised detection methods, our method exhibits better performance in various measurements (i.e., AUC, FPR, TPR) over different datasets (MNIST, Fashion-MNIST and CIFAR-10), different adversarial attack methods (FGSM, BIM, PGD, DeepFool, and CW) and different victim models (8-layer CNN and 16-layer VGG). We compare our method with the state-of-the-art self-supervised detection methods under different adversarial attacks and different victim models (30 attack settings), and it exhibits better performance in various measurements (AUC, FPR, TPR) for most attacks settings. Ideally, AUC is $1$ and our method achieves $0.99+$ on CIFAR-10 for all attacks. Notably, different from other Autoencoder-based detectors, our method can provide resistance to the adaptive adversary.
翻译:深层次的学习模式众所周知,很容易受到为恶意目的精心设计的对抗性例子的伤害,并且对人类感官系统来说是无法察觉的。自动编码器,如果仅经过良性实例的培训,完全以良性实例为基础,被广泛用于(自我监督的)对抗性检测,所依据的假设是,对抗性实例产生更大的重建错误。然而,由于在培训中缺乏对抗性实例,自动编码器过于强的概括性能力,这一假设在实践中并不总是真实的。为了缓解这一问题,我们探索如何通过在自动编码结构下解析图像来发现对抗性实例。通过将输入性图像分解为类特征和语义特征,我们培训了自动编码器,同时利用一个歧视性网络,既包括正确的对等类/情感特征,又不正确的对等类/情绪特征。这让对抗性实例的行为与对抗性能不尽一样,并且可以减少自动编码器攻击的不必要概括性能。 与最先进的自我监督性攻击模式相比,最高级的自我追踪性能模型,不同方法,不同方法,不同方法,以及不同方法,不同方法,即甚甚甚甚、甚甚甚甚甚甚甚甚、甚甚甚甚、甚、甚、甚甚、甚甚甚的内地变压、甚、甚、甚、甚、甚、甚、甚、甚、甚、甚、甚、甚、甚、甚、甚、甚、甚、甚、甚、甚、甚、甚、甚、甚、甚、甚、甚、甚、甚、甚、甚、甚、甚、甚、甚、甚、甚、甚、甚、甚、甚、甚、甚、甚、甚、甚、甚、甚、甚、甚、甚、甚、甚、甚、甚、甚、甚、甚、甚、甚、甚、甚、甚、甚、甚、甚、甚、甚、甚、甚、甚、甚、甚、甚、甚、甚、甚、甚、甚、甚、甚、甚、甚、甚、甚、甚、甚、甚、甚、甚、甚、甚、甚、甚、甚、甚、甚、甚、甚、甚、甚、甚、甚、甚、甚、甚、甚、甚、甚、甚、甚、甚、甚、甚、甚、甚、甚、甚、