Adversarial examples pose a threat to deep neural network models in a variety of scenarios, from settings where the adversary has complete knowledge of the model in a "white box" setting and to the opposite in a "black box" setting. In this paper, we explore the use of output randomization as a defense against attacks in both the black box and white box models and propose two defenses. In the first defense, we propose output randomization at test time to thwart finite difference attacks in black box settings. Since this type of attack relies on repeated queries to the model to estimate gradients, we investigate the use of randomization to thwart such adversaries from successfully creating adversarial examples. We empirically show that this defense can limit the success rate of a black box adversary using the Zeroth Order Optimization attack to 0%. Secondly, we propose output randomization training as a defense against white box adversaries. Unlike prior approaches that use randomization, our defense does not require its use at test time, eliminating the Backward Pass Differentiable Approximation attack, which was shown to be effective against other randomization defenses. Additionally, this defense has low overhead and is easily implemented, allowing it to be used together with other defenses across various model architectures. We evaluate output randomization training against the Projected Gradient Descent attacker and show that the defense can reduce the PGD attack's success rate down to 12% when using cross-entropy loss.
翻译:Adversarial 实例对深神经网络模型构成了威胁,从对手完全了解“白箱”设置中的模型,到“黑箱”设置中的反面。在本文中,我们探索使用输出随机化来防范黑箱和白箱模型中的攻击,并提议两种防御。在第一个防御中,我们提议在试验时进行输出随机化,以挫败黑箱设置中的有限差异攻击。由于这种类型的攻击依靠反复询问来估计梯度,我们调查使用随机化来挫败这些对手成功创建对抗性例子。我们实验性地表明,使用Zeroth命令的优化攻击来限制黑箱敌人的成功率。第二,我们提议使用输出随机化训练来防范黑箱中的对手。与先前使用随机化的方法不同,我们的防御不需要在试验时使用它,消除后方的可变适应性适应性攻击,这已证明能有效防止其他随机化防御,从而成功创建对抗敌对的例子。此外,这种防御可以将黑箱敌人的成功率限制到0 %。我们使用低的压式设计工具,可以轻松地显示,在12次的国防中进行设计中进行。