Defending machine-learning (ML) models against white-box adversarial attacks has proven to be extremely difficult. Instead, recent work has proposed stateful defenses in an attempt to defend against a more restricted black-box attacker. These defenses operate by tracking a history of incoming model queries, and rejecting those that are suspiciously similar. The current state-of-the-art stateful defense Blacklight was proposed at USENIX Security '22 and claims to prevent nearly 100% of attacks on both the CIFAR10 and ImageNet datasets. In this paper, we observe that an attacker can significantly reduce the accuracy of a Blacklight-protected classifier (e.g., from 82.2% to 6.4% on CIFAR10) by simply adjusting the parameters of an existing black-box attack. Motivated by this surprising observation, since existing attacks were evaluated by the Blacklight authors, we provide a systematization of stateful defenses to understand why existing stateful defense models fail. Finally, we propose a stronger evaluation strategy for stateful defenses comprised of adaptive score and hard-label based black-box attacks. We use these attacks to successfully reduce even reconfigured versions of Blacklight to as low as 0% robust accuracy.
翻译:对抗机器学习模型的白盒攻击防御被证明极其困难。近期工作提出了有状态防御,以尝试对抗更受限的黑盒攻击。这些防御通过跟踪模型查询的历史记录并拒绝那些可疑的查询来运作。现有技术的有状态防御 Blacklight 在 USENIX Security '22 中被提出,并声称可以防止 CIFAR10 和 ImageNet 数据集上近 100% 的攻击。在本文中,我们观察到,攻击者可以通过调整现有黑盒攻击的参数显著降低 Blacklight 防护的分类器的准确性(例如,在 CIFAR10 上从 82.2% 降至 6.4%)。受到这一惊人的发现启发,我们提供了有状态防御的系统化分类,以了解现有的有状态防御模型为何失败。最后,我们提出了一种更强的有状态防御评估策略,包括自适应分数和硬标签的黑盒攻击。我们使用这些攻击成功将经过重新配置的 Blacklight 降至甚至 0% 的鲁棒准确性。