Deep neural networks are vulnerable to small input perturbations known as adversarial attacks. Inspired by the fact that these adversaries are constructed by iteratively minimizing the confidence of a network for the true class label, we propose the anti-adversary layer, aimed at countering this effect. In particular, our layer generates an input perturbation in the opposite direction of the adversarial one, and feeds the classifier a perturbed version of the input. Our approach is training-free and theoretically supported. We verify the effectiveness of our approach by combining our layer with both nominally and robustly trained models, and conduct large scale experiments from black-box to adaptive attacks on CIFAR10, CIFAR100 and ImageNet. Our anti-adversary layer significantly enhances model robustness while coming at no cost on clean accuracy.
翻译:深神经网络很容易受到被称为对抗性攻击的小型输入干扰。这些对手的建立是由于反复减少真正等级标签网络的信心,我们提议反逆层,以对抗这一效应。特别是,我们的层在对抗性网络的相反方向产生一种输入扰动,为分类器输入一个扰动版本。我们的方法是没有培训的,理论上支持的。我们通过将我们这一层与名义上和经过严格训练的模型相结合来核查我们的方法的有效性,并进行大规模实验,从黑箱到对CIFAR10、CIFAR100和图像网络的适应性攻击。我们的反反偏移层大大加强了模型的稳健性,而没有成本的准确性。