Localized adversarial patches aim to induce misclassification in machine learning models by arbitrarily modifying pixels within a restricted region of an image. Such attacks can be realized in the physical world by attaching the adversarial patch to the object to be misclassified, and defending against such attacks is an unsolved/open problem. In this paper, we propose a general defense framework called PatchGuard that can achieve high provable robustness while maintaining high clean accuracy against localized adversarial patches. The cornerstone of PatchGuard involves the use of CNNs with small receptive fields to impose a bound on the number of features corrupted by an adversarial patch. Given a bounded number of corrupted features, the problem of designing an adversarial patch defense reduces to that of designing a secure feature aggregation mechanism. Towards this end, we present our robust masking defense that robustly detects and masks corrupted features to recover the correct prediction. Notably, we can prove the robustness of our defense against any adversary within our threat model. Our extensive evaluation on ImageNet, ImageNette (a 10-class subset of ImageNet), and CIFAR-10 datasets demonstrates that our defense achieves state-of-the-art performance in terms of both provable robust accuracy and clean accuracy.
翻译:本地化的对立对立面补丁旨在通过任意修改限制图像区域内的像素,在机器学习模式中诱使错误分类。这种攻击可以在物理世界中通过将对立面补丁附在被错误分类的对象上来实现。 防范这类攻击是一个尚未解决/尚未解决的问题。 在本文件中,我们提议了一个称为PatchGuard的一般防御框架,这个框架可以实现高度可辨识的稳健性,同时对局部对立面补丁保持高度清洁的准确性。 PatchGuard 的基石是使用拥有小型可接受域的CNN对被对对对立面补丁腐蚀的特征进行约束。鉴于有一定数量的腐败特征,设计一个对立面补丁防御的问题将减少至设计一个安全特征集合机制的问题。 为此,我们展示了我们强有力的防守防守屏障,能够强有力地探测和掩蔽被腐蚀的特征,以恢复正确的预测。 值得注意的是,我们可以证明我们防御对威胁模式内任何对手的强健性。 我们对图像网、图像网(图像网的10类子)和CIFAR10数据集的大规模评价显示我们的防御在清洁条件中的准确性能。