Defending machine-learning (ML) models against white-box adversarial attacks has proven to be extremely difficult. Instead, recent work has proposed stateful defenses in an attempt to defend against a more restricted black-box attacker. These defenses operate by tracking a history of incoming model queries, and rejecting those that are suspiciously similar. The current state-of-the-art stateful defense Blacklight was proposed at USENIX Security '22 and claims to prevent nearly 100% of attacks on both the CIFAR10 and ImageNet datasets. In this paper, we observe that an attacker can significantly reduce the accuracy of a Blacklight-protected classifier (e.g., from 82.2% to 6.4% on CIFAR10) by simply adjusting the parameters of an existing black-box attack. Motivated by this surprising observation, since existing attacks were evaluated by the Blacklight authors, we provide a systematization of stateful defenses to understand why existing stateful defense models fail. Finally, we propose a stronger evaluation strategy for stateful defenses comprised of adaptive score and hard-label based black-box attacks. We use these attacks to successfully reduce even reconfigured versions of Blacklight to as low as 0% robust accuracy.
翻译:捍卫机器学习模式以对抗白箱对抗性攻击的机器学习模式(ML) 证明是非常困难的。 相反,最近的工作提议了国家防御,以试图防御更受限制的黑盒攻击者。这些防御通过追踪新来模式查询的历史,并拒绝那些令人怀疑的相似的。目前最先进的国家防御黑灯在USENIX Security'22上提出,并声称要防止对CIFAR10和图像网络数据集的近100%的攻击。在本文中,我们发现攻击者可以大幅降低黑灯保护的分类器(例如CIFAR10上的从82.2%到6.4%)的准确性,只需调整现有的黑盒攻击的参数。我们受到这一惊人观察的激励,因为黑灯作者对现有的攻击进行了评估,我们提供了一种国家防御系统的系统化,以了解现有的国家防御模式为何失败。最后,我们提议对由适应性分数和硬标签黑盒攻击组成的国家防御系统进行更强有力的评价战略。我们用这些攻击的精确性来成功地降低黑色光度。</s>