认证半径引导的攻击框架对图像分割模型 (A Certified Radius-Guided Attack Framework to Image Segmentation Models)

Image segmentation is an important problem in many safety-critical applications. Recent studies show that modern image segmentation models are vulnerable to adversarial perturbations, while existing attack methods mainly follow the idea of attacking image classification models. We argue that image segmentation and classification have inherent differences, and design an attack framework specially for image segmentation models. Our attack framework is inspired by certified radius, which was originally used by defenders to defend against adversarial perturbations to classification models. We are the first, from the attacker perspective, to leverage the properties of certified radius and propose a certified radius guided attack framework against image segmentation models. Specifically, we first adapt randomized smoothing, the state-of-the-art certification method for classification models, to derive the pixel's certified radius. We then focus more on disrupting pixels with relatively smaller certified radii and design a pixel-wise certified radius guided loss, when plugged into any existing white-box attack, yields our certified radius-guided white-box attack. Next, we propose the first black-box attack to image segmentation models via bandit. We design a novel gradient estimator, based on bandit feedback, which is query-efficient and provably unbiased and stable. We use this gradient estimator to design a projected bandit gradient descent (PBGD) attack, as well as a certified radius-guided PBGD (CR-PBGD) attack. We prove our PBGD and CR-PBGD attacks can achieve asymptotically optimal attack performance with an optimal rate. We evaluate our certified-radius guided white-box and black-box attacks on multiple modern image segmentation models and datasets. Our results validate the effectiveness of our certified radius-guided attack framework.

翻译：图像分割在许多关键安全应用程序上是一个重要问题。最近的研究表明，现代图像分割模型容易受到对抗性扰动的攻击，而现有的攻击方法主要是针对图像分类模型的攻击思路。我们认为图像分割和分类具有本质差异，因此针对图像分割模型，设计了一个特别的攻击框架。我们的攻击框架受认证半径启发，认证半径最初是由防卫者用于保护分类模型免受对抗性攻击。我们是第一个从攻击者角度出发，利用认证半径性质并提出了一种针对图像分割模型的认证半径引导的攻击框架。具体而言，我们首先借鉴随机平滑的思想，为每个像素计算认证半径。之后，我们侧重于破坏具有相对较小认证半径的像素，设计了一个基于像素认证半径的损失函数。当将其插入任何现有的白盒攻击中，会产生我们的认证半径引导的白盒攻击。其次，我们提出了第一个针对图像分割模型的黑盒攻击方法。我们设计了一个基于bandit反馈的新型梯度估计器，其查询效率高，且具有可证明的无偏稳定性。我们使用这个梯度估计器来设计一个投影bandit梯度下降（PBGD）攻击，以及一个认证半径引导的PBGD（CR-PBGD）攻击。我们证明了我们的PBGD和CR-PBGD攻击可以达到最优攻击性能，并具有最优速率。我们在多个现代图像分割模型和数据集上评估了我们的认证半径引导白盒攻击和黑盒攻击。我们的结果验证了我们的认证半径引导的攻击框架的有效性。