While multi-step adversarial training is widely popular as an effective defense method against strong adversarial attacks, its computational cost is notoriously expensive, compared to standard training. Several single-step adversarial training methods have been proposed to mitigate the above-mentioned overhead cost; however, their performance is not sufficiently reliable depending on the optimization setting. To overcome such limitations, we deviate from the existing input-space-based adversarial training regime and propose a single-step latent adversarial training method (SLAT), which leverages the gradients of latent representation as the latent adversarial perturbation. We demonstrate that the L1 norm of feature gradients is implicitly regularized through the adopted latent perturbation, thereby recovering local linearity and ensuring reliable performance, compared to the existing single-step adversarial training methods. Because latent perturbation is based on the gradients of the latent representations which can be obtained for free in the process of input gradients computation, the proposed method costs roughly the same time as the fast gradient sign method. Experiment results demonstrate that the proposed method, despite its structural simplicity, outperforms state-of-the-art accelerated adversarial training methods.
翻译:虽然多步骤对抗性培训作为抵御激烈对抗性攻击的有效防御方法受到广泛欢迎,但其计算成本与标准培训相比,是昂贵而臭名昭著的。已提出若干单步对抗性培训方法以降低上述间接费用;然而,根据优化环境,其业绩不够可靠。为了克服这些限制,我们偏离了现有的基于投入空间的对抗性培训制度,并提出了一个单步潜在潜在对抗性培训方法(SLAT),该方法将潜在代表性的梯度作为潜在的对抗性扰动方法。我们证明,与现有的单步对抗性培训方法相比,特征梯度的L1标准通过采用的潜在扰动来隐含地正规化,从而恢复了本地的线性并确保了可靠的绩效。由于潜在扰动性是基于在输入梯度计算过程中可以免费获得的潜在代表的梯度梯度梯度梯度,因此拟议方法的成本与快速梯度标志方法大致相同。实验结果表明,拟议方法尽管结构简单,但仍然超越了先进的加速对抗性培训方法。