Adversarial training, especially projected gradient descent (PGD), has proven to be a successful approach for improving robustness against adversarial attacks. After adversarial training, gradients of models with respect to their inputs have a preferential direction. However, the direction of alignment is not mathematically well established, making it difficult to evaluate quantitatively. We propose a novel definition of this direction as the direction of the vector pointing toward the closest point of the support of the closest inaccurate class in decision space. To evaluate the alignment with this direction after adversarial training, we apply a metric that uses generative adversarial networks to produce the smallest residual needed to change the class present in the image. We show that PGD-trained models have a higher alignment than the baseline according to our definition, that our metric presents higher alignment values than a competing metric formulation, and that enforcing this alignment increases the robustness of models.
翻译:----
对抗训练,尤其是投影梯度下降(PGD),已被证明是提高对抗攻击鲁棒性的成功方法。经过对抗训练后,模型相对于输入的梯度具有优先方向。然而,对齐的方向在数学上并未得到很好的证明,因此难以进行定量评估。我们提出了一种新的定义,将该方向定义为指向决策空间中最接近的错误类别的支持点的向量方向。为了评估对抗性训练后该方向的对齐情况,我们应用了一个度量标准,利用生成对抗网络产生最小的残差以更改图像中存在的类别。我们表明,PGD训练出的模型根据我们的定义具有更高的对齐度,我们的度量标准与竞争度量公式相比具有更高的对齐度值,并且强制实现此对齐度可以提高模型的稳健性。