Vision transformers (ViTs) have demonstrated impressive performance and stronger adversarial robustness compared to Convolutional Neural Networks (CNNs). On the one hand, ViTs' focus on global interaction between individual patches reduces the local noise sensitivity of images. On the other hand, the neglect of noise sensitivity differences between image regions by existing decision-based attacks further compromises the efficiency of noise compression, especially for ViTs. Therefore, validating the black-box adversarial robustness of ViTs when the target model can only be queried still remains a challenging problem. In this paper, we theoretically analyze the limitations of existing decision-based attacks from the perspective of noise sensitivity difference between regions of the image, and propose a new decision-based black-box attack against ViTs, termed Patch-wise Adversarial Removal (PAR). PAR divides images into patches through a coarse-to-fine search process and compresses the noise on each patch separately. PAR records the noise magnitude and noise sensitivity of each patch and selects the patch with the highest query value for noise compression. In addition, PAR can be used as a noise initialization method for other decision-based attacks to improve the noise compression efficiency on both ViTs and CNNs without introducing additional calculations. Extensive experiments on three datasets demonstrate that PAR achieves a much lower noise magnitude with the same number of queries.
翻译:与进化神经网络相比,视觉变压器表现出了令人印象深刻的性能和较强的对抗性强力。一方面,ViTs侧重于单个补丁之间的全球互动,减少了图像对当地噪音的敏感度。另一方面,现有基于决策的攻击忽略了图像区域之间的噪音敏感度差异,进一步削弱了噪声压缩的效率,特别是对ViTs而言。因此,当目标模型只能被问到时,验证ViTs的黑盒对称强力仍是一个具有挑战性的问题。在本文件中,我们从理论上从图像区域之间的噪音敏感度差异的角度分析现有基于决定的袭击的局限性,并提出针对ViTs的基于决定的黑箱袭击,称为Patch-witter-Aversarial 清除(PAR)。PAR将图像通过粗略到纤维的搜索程序将图像分割为补丁,并将每个补丁的噪音分开。PARCS记录每个补丁的噪声量和噪声敏感度,并选择具有最高调调值的补丁。此外,PARPAR可以使用基于决定的快速测算器对VIT进行新的测算方法,从而将更多进行微波测算。