Adversarial training has been shown as an effective approach to improve the robustness of image classifiers against white-box attacks. However, its effectiveness against black-box attacks is more nuanced. In this work, we demonstrate that some geometric consequences of adversarial training on the decision boundary of deep networks give an edge to certain types of black-box attacks. In particular, we define a metric called robustness gain to show that while adversarial training is an effective method to dramatically improve the robustness in white-box scenarios, it may not provide such a good robustness gain against the more realistic decision-based black-box attacks. Moreover, we show that even the minimal perturbation white-box attacks can converge faster against adversarially-trained neural networks compared to the regular ones.
翻译:双向培训被证明是提高图像分类人员抵御白箱袭击的稳健性的有效方法。 但是,它对抗黑箱袭击的效果更加细微。 在这项工作中,我们证明,对深海网络决策界限的对抗性培训的某些几何后果使某些类型的黑箱袭击更为突出。 特别是,我们定义了一种称为强力的衡量标准,以表明,虽然对抗性培训是大幅提高白箱情景中稳健性的有效方法,但对于更现实的基于决定的黑箱袭击,它可能无法提供如此良好的稳健性。 此外,我们表明,即使是最低限度的触动性白箱袭击也能够比常规袭击更快地集中到经对抗性训练的神经网络。