Vision transformers (ViTs) have demonstrated impressive performance and stronger adversarial robustness compared to Deep Convolutional Neural Networks (CNNs). On the one hand, ViTs' focus on global interaction between individual patches reduces the local noise sensitivity of images. On the other hand, the existing decision-based attacks for CNNs ignore the difference in noise sensitivity between different regions of the image, which affects the efficiency of noise compression. Therefore, validating the black-box adversarial robustness of ViTs when the target model can only be queried still remains a challenging problem. In this paper, we propose a new decision-based black-box attack against ViTs termed Patch-wise Adversarial Removal (PAR). PAR divides images into patches through a coarse-to-fine search process and compresses the noise on each patch separately. PAR records the noise magnitude and noise sensitivity of each patch and selects the patch with the highest query value for noise compression. In addition, PAR can be used as a noise initialization method for other decision-based attacks to improve the noise compression efficiency on both ViTs and CNNs without introducing additional calculations. Extensive experiments on ImageNet-21k, ILSVRC-2012, and Tiny-Imagenet datasets demonstrate that PAR achieves a much lower magnitude of perturbation on average with the same number of queries.
翻译:与深革命神经网络相比,视觉变压器(ViTs)表现出令人印象深刻的性能和较强的对抗性强强。一方面,ViTs关注单个补丁之间的全球互动减少了图像对当地噪音的敏感度。另一方面,CNN公司现有的基于决定的袭击忽略了图像不同区域之间噪音敏感性的差异,从而影响噪音压缩的效率。因此,在目标模型只能被查询时验证ViTs的黑盒对立强度仍然是一个具有挑战性的问题。在本文件中,我们提议对称为Patch-with Aversarial删除(PAR)的ViTs进行新的基于决定的黑箱袭击。PAR通过粗到fine搜索进程将图像分割成补丁,并将每个补丁的噪音分开。PAR记录每个补丁的噪音强度和噪音敏感度,并以最高查询值选择对噪音压缩的补丁。此外,PAR可以使用一种基于噪音的初始化方法,用于其他基于决定的袭击,以提高VT-WAR公司对PR的噪音压缩效率。PAR将图像通过粗微的图像测试,在VRIS-S上进行IBSeral-SBARSA上,不作更多的数据分析。