Although deep neural networks (DNNs) have made rapid progress in recent years, they are vulnerable in adversarial environments. A malicious backdoor could be embedded in a model by poisoning the training dataset, whose intention is to make the infected model give wrong predictions during inference when the specific trigger appears. To mitigate the potential threats of backdoor attacks, various backdoor detection and defense methods have been proposed. However, the existing techniques usually require the poisoned training data or access to the white-box model, which is commonly unavailable in practice. In this paper, we propose a black-box backdoor detection (B3D) method to identify backdoor attacks with only query access to the model. We introduce a gradient-free optimization algorithm to reverse-engineer the potential trigger for each class, which helps to reveal the existence of backdoor attacks. In addition to backdoor detection, we also propose a simple strategy for reliable predictions using the identified backdoored models. Extensive experiments on hundreds of DNN models trained on several datasets corroborate the effectiveness of our method under the black-box setting against various backdoor attacks.
翻译:尽管近年来深神经网络(DNNs)取得了快速进展,但它们在对抗环境中很脆弱。恶意后门可以通过毒害培训数据集嵌入模型,该数据集的意图是让受感染模型在特定触发物出现时作出错误的预测。为了减轻后门攻击的潜在威胁,提出了各种后门探测和防御方法。然而,现有技术通常需要有毒的培训数据或进入白箱模型,而后者在实践中通常无法使用。在本文中,我们提议采用黑箱后门探测(B3D)方法,以识别后门攻击,只有查询该模型。我们采用了一种无梯度优化算法,以反向设计每类攻击的潜在触发物,这有助于揭示后门攻击的存在。除了后门探测之外,我们还提出了使用已查明的后门模型进行可靠预测的简单战略。在几套数据集培训的数百个DNN模型进行了广泛的实验,证实了我们用黑箱设置对付各种后门攻击的方法的有效性。