We present a method for provably defending any pretrained image classifier against $\ell_p$ adversarial attacks. By prepending a custom-trained denoiser to any off-the-shelf image classifier and using randomized smoothing, we effectively create a new classifier that is guaranteed to be $\ell_p$-robust to adversarial examples, without modifying the pretrained classifier. The approach applies both to the case where we have full access to the pretrained classifier as well as the case where we only have query access. We refer to this defense as black-box smoothing, and we demonstrate its effectiveness through extensive experimentation on ImageNet and CIFAR-10. Finally, we use our method to provably defend the Azure, Google, AWS, and ClarifAI image classification APIs. Our code replicating all the experiments in the paper can be found at https://github.com/microsoft/blackbox-smoothing .
翻译:我们提出一种方法来保护任何经过预先训练的图像分类,防止$@ ell_ p$ 对抗性攻击。我们通过向任何现成图像分类员预先提供经过定制训练的解调器,并使用随机的平滑,有效地建立了一个新的分类器,保证在不修改事先训练分类员的情况下,对对抗性例子采用$ell_ p$- robust。这个方法既适用于我们完全接触经过训练的分类员的情况,也适用于我们只有查询机会的情况。我们把这种辩护称为黑箱平滑,我们通过在图像网络和CIFAR-10上的广泛实验来展示其有效性。最后,我们用我们的方法来保护Azure、Google、AWS和ClarifAI图像分类 APIs。我们的代码复制了本文中的所有实验,可以在https://github.com/microsoft/blackbox-smoothing上找到。