Adversarial training, the process of training a deep learning model with adversarial data, is one of the most successful adversarial defense methods for deep learning models. We have found that the robustness to white-box attack of an adversarially trained model can be further improved if we fine tune this model in inference stage to adapt to the adversarial input, with the extra information in it. We introduce an algorithm that "post trains" the model at inference stage between the original output class and a "neighbor" class, with existing training data. The accuracy of pre-trained Fast-FGSM CIFAR10 classifier base model against white-box projected gradient attack (PGD) can be significantly improved from 46.8% to 64.5% with our algorithm.
翻译:Aversarial 培训过程,即用对抗数据训练深学习模式的过程,是深学习模式最成功的对抗性防御方法之一。我们发现,如果我们精细调整这一模式的推论阶段以适应对抗性输入,并附上额外信息,那么,对经对抗性训练的模式的白箱攻击的强性可以进一步提高。我们引入了一种算法,即在原始产出类和现有培训数据“邻居”类之间的推论阶段“设置火车”模型的“设置”算法。用我们的算法可以大大提高预先训练的快速FGSM CIFAR10分类模型对白箱预测梯度攻击的精确性,从46.8%提高到64.5%。