Adversarial patch attack aims to fool a machine learning model by arbitrarily modifying pixels within a restricted region of an input image. Such attacks are a major threat to models deployed in the physical world, as they can be easily realized by presenting a customized object in the camera view. Defending against such attacks is challenging due to the arbitrariness of patches, and existing provable defenses suffer from poor certified accuracy. In this paper, we propose PatchVeto, a zero-shot certified defense against adversarial patches based on Vision Transformer (ViT) models. Rather than training a robust model to resist adversarial patches which may inevitably sacrifice accuracy, PatchVeto reuses a pretrained ViT model without any additional training, which can achieve high accuracy on clean inputs while detecting adversarial patched inputs by simply manipulating the attention map of ViT. Specifically, each input is tested by voting over multiple inferences with different attention masks, where at least one inference is guaranteed to exclude the adversarial patch. The prediction is certifiably robust if all masked inferences reach consensus, which ensures that any adversarial patch would be detected with no false negative. Extensive experiments have shown that PatchVeto is able to achieve high certified accuracy (e.g. 67.1% on ImageNet for 2%-pixel adversarial patches), significantly outperforming state-of-the-art methods. The clean accuracy is the same as vanilla ViT models (81.8% on ImageNet) since the model parameters are directly reused. Meanwhile, our method can flexibly handle different adversarial patch sizes by simply changing the masking strategy.
翻译:Adversarial 补丁攻击的目的是通过任意修改输入图像限制区域内的像素参数来愚弄机器学习模型。 这种攻击是对在物理世界中部署的模型的重大威胁,因为通过在相机视图中展示一个定制对象很容易实现。 保护这种攻击具有挑战性, 因为补丁的任意性, 而现有的可辨别防守的准确性很低。 在本文中, 我们提议PatchVeto, 一种以 Vision Transfererer (VIT) 模型为基础, 对抗对抗对称补丁的零率认证证明防御。 而不是训练一个强大的模型, 抵制可能会不可避免地牺牲精确度的对立补丁, PatchVeto再利用一个预先训练的 VIT 模型, 而无需任何额外的培训, 就可以在仅仅通过对 VIT 的注意映射图中检测到高精度的对调输入。 具体地说, 每项输入都是通过使用不同的注意面罩来投票的多重推断, 至少有一种推力模型可以排除对等补补。 如果所有掩码都能够达成共识的补补补补补补补补补,,, 则可以正常 。