Adversarial training, especially projected gradient descent (PGD), has proven to be a successful approach for improving robustness against adversarial attacks. After adversarial training, gradients of models with respect to their inputs have a preferential direction. However, the direction of alignment is not mathematically well established, making it difficult to evaluate quantitatively. We propose a novel definition of this direction as the direction of the vector pointing toward the closest point of the support of the closest inaccurate class in decision space. To evaluate the alignment with this direction after adversarial training, we apply a metric that uses generative adversarial networks to produce the smallest residual needed to change the class present in the image. We show that PGD-trained models have a higher alignment than the baseline according to our definition, that our metric presents higher alignment values than a competing metric formulation, and that enforcing this alignment increases the robustness of models.
翻译:Aversarial培训,特别是预测的梯度下降(PGD)已证明是提高抵御对抗性攻击的稳健性的成功方法。在进行对抗性培训后,模型投入的梯度具有优先方向。然而,调整方向在数学上没有很好确定,因此难以从数量上进行评估。我们建议对这一方向进行新的定义,作为矢量方向,指向最接近决策空间不准确阶层的支持点。为了评价对抗性培训后与这一方向的一致,我们采用了一种指标,即利用基因对抗性网络生成改变图像中现有阶层所需的最小的残余物。我们表明,根据我们的定义,经过PGD培训的模型比基线更加一致,我们的衡量指标显示的调整值高于相互竞争的计量公式,而执行这一调整则增加了模型的稳健性。