This paper proposes an attack-independent (non-adversarial training) technique for improving adversarial robustness of neural network models, with minimal loss of standard accuracy. We suggest creating a neighborhood around each training example, such that the label is kept constant for all inputs within that neighborhood. Unlike previous work that follows a similar principle, we apply this idea by extending the training set with multiple perturbations for each training example, drawn from within the neighborhood. These perturbations are model independent, and remain constant throughout the entire training process. We analyzed our method empirically on MNIST, SVHN, and CIFAR-10, under different attacks and conditions. Results suggest that the proposed approach improves standard accuracy over other defenses while having increased robustness compared to vanilla adversarial training.
翻译:本文建议采用攻击独立的(非对抗性培训)技术,提高神经网络模型的对抗性强健性,尽量减少标准准确性的损失。 我们建议在每个培训范例周围建立一个街区,使该街区内的所有投入都保持标签不变。 与以往遵循类似原则的工作不同,我们采用这一想法,即扩大培训范围,对每个培训范例都进行多次干扰,从该街区内抽调。 这些干扰是示范性的,在整个培训过程中保持不变。 我们在不同的攻击和条件下对MNIST、SVHN和CIFAR-10进行了经验分析。 结果表明,拟议方法提高了其他防御的标准准确性,同时比Vanilla对抗性培训更加稳健。