以分解代表方式探测自我监督的反反反向行为实例 (Self-Supervised Adversarial Example Detection by Disentangled Representation)

Deep learning models are known to be vulnerable to adversarial examples that are elaborately designed for malicious purposes and are imperceptible to the human perceptual system. Autoencoder, when trained solely over benign examples, has been widely used for (self-supervised) adversarial detection based on the assumption that adversarial examples yield larger reconstruction error. However, because lacking adversarial examples in its training and the too strong generalization ability of autoencoder, this assumption does not always hold true in practice. To alleviate this problem, we explore to detect adversarial examples by disentangled representations of images under the autoencoder structure. By disentangling input images as class features and semantic features, we train an autoencoder, assisted by a discriminator network, over both correctly paired class/semantic features and incorrectly paired class/semantic features to reconstruct benign and counterexamples. This mimics the behavior of adversarial examples and can reduce the unnecessary generalization ability of autoencoder. Compared with the state-of-the-art self-supervised detection methods, our method exhibits better performance in various measurements (i.e., AUC, FPR, TPR) over different datasets (MNIST, Fashion-MNIST and CIFAR-10), different adversarial attack methods (FGSM, BIM, PGD, DeepFool, and CW) and different victim models (8-layer CNN and 16-layer VGG). We compare our method with the state-of-the-art self-supervised detection methods under different adversarial attacks and different victim models (30 attack settings), and it exhibits better performance in various measurements (AUC, FPR, TPR) for most attacks settings. Ideally, AUC is $1$ and our method achieves $0.99+$ on CIFAR-10 for all attacks. Notably, different from other Autoencoder-based detectors, our method can provide resistance to the adaptive adversary.

翻译：深层次的学习模式众所周知,很容易受到为恶意目的精心设计的对抗性例子的伤害,并且对人类感官系统来说是无法察觉的。自动编码器,如果仅经过良性实例的培训,完全以良性实例为基础,被广泛用于(自我监督的)对抗性检测,所依据的假设是,对抗性实例产生更大的重建错误。然而,由于在培训中缺乏对抗性实例,自动编码器过于强的概括性能力,这一假设在实践中并不总是真实的。为了缓解这一问题,我们探索如何通过在自动编码结构下解析图像来发现对抗性实例。通过将输入性图像分解为类特征和语义特征,我们培训了自动编码器,同时利用一个歧视性网络,既包括正确的对等类/情感特征,又不正确的对等类/情绪特征。这让对抗性实例的行为与对抗性能不尽一样,并且可以减少自动编码器攻击的不必要概括性能。与最先进的自我监督性攻击模式相比,最高级的自我追踪性能模型,不同方法,不同方法,不同方法,以及不同方法,不同方法,即甚甚甚甚、甚甚甚甚甚甚甚甚、甚甚甚甚、甚、甚、甚甚、甚甚甚的内地变压、甚、甚、甚、甚、甚、甚、甚、甚、甚、甚、甚、甚、甚、甚、甚、甚、甚、甚、甚、甚、甚、甚、甚、甚、甚、甚、甚、甚、甚、甚、甚、甚、甚、甚、甚、甚、甚、甚、甚、甚、甚、甚、甚、甚、甚、甚、甚、甚、甚、甚、甚、甚、甚、甚、甚、甚、甚、甚、甚、甚、甚、甚、甚、甚、甚、甚、甚、甚、甚、甚、甚、甚、甚、甚、甚、甚、甚、甚、甚、甚、甚、甚、甚、甚、甚、甚、甚、甚、甚、甚、甚、甚、甚、甚、甚、甚、甚、甚、甚、甚、甚、甚、甚、甚、甚、甚、甚、甚、甚、甚、甚、甚、甚、甚、甚、甚、甚、甚、

相关内容

自编码器

关注 140

自动编码器是一种人工神经网络，用于以无监督的方式学习有效的数据编码。自动编码器的目的是通过训练网络忽略信号“噪声”来学习一组数据的表示（编码），通常用于降维。与简化方面一起，学习了重构方面，在此，自动编码器尝试从简化编码中生成尽可能接近其原始输入的表示形式，从而得到其名称。基本模型存在几种变体，其目的是迫使学习的输入表示形式具有有用的属性。自动编码器可有效地解决许多应用问题，从面部识别到获取单词的语义。

【MIT】自监督几何感知，22页ppt，Self-supervised Geometric Perception

专知会员服务

23+阅读 · 2021年6月3日

【Google】平滑对抗训练，Smooth Adversarial Training

专知会员服务

49+阅读 · 2020年7月4日

【CVPR2020-Uber】物理上可实现的对抗性的例子，用于激光雷达的目标检测，Physically Realizable Adversarial Examples for LiDAR Object Detection

专知会员服务

22+阅读 · 2020年4月16日

强化学习的对比无监督表示，CURL: Contrastive Unsupervised Representations for Reinforcement Learning

专知会员服务

42+阅读 · 2020年4月11日