Deep neural networks are proven to be vulnerable to backdoor attacks. Detecting the trigger samples during the inference stage, i.e., the test-time trigger sample detection, can prevent the backdoor from being triggered. However, existing detection methods often require the defenders to have high accessibility to victim models, extra clean data, or knowledge about the appearance of backdoor triggers, limiting their practicality. In this paper, we propose the test-time corruption robustness consistency evaluation (TeCo), a novel test-time trigger sample detection method that only needs the hard-label outputs of the victim models without any extra information. Our journey begins with the intriguing observation that the backdoor-infected models have similar performance across different image corruptions for the clean images, but perform discrepantly for the trigger samples. Based on this phenomenon, we design TeCo to evaluate test-time robustness consistency by calculating the deviation of severity that leads to predictions' transition across different corruptions. Extensive experiments demonstrate that compared with state-of-the-art defenses, which even require either certain information about the trigger types or accessibility of clean data, TeCo outperforms them on different backdoor attacks, datasets, and model architectures, enjoying a higher AUROC by 10% and 5 times of stability.
翻译:深度神经网络已被证明容易受到后门攻击。检测触发样本,即测试时间的触发样本检测,可以防止触发后门。然而,现有的检测方法通常需要防御者高度接触受害者模型、额外的干净数据或有关后门触发器外观的知识,从而限制了它们的实用性。本文提出了一种新的测试时触发样本检测方法,即测试时崩溃稳健性一致性评估(TeCo)。该方法仅需要受害者模型的硬标签输出,不需要任何额外信息。本文的研究始于一个有趣的观察,即感染后门的模型在干净图像的不同破坏形式下表现相似,但在触发样本上存在差异。基于这一现象,我们设计了TeCo,通过计算不同破坏形式引起预测转变的严重性偏差来评估测试时稳健性一致性。大量实验表明,与最先进的防御方式相比,TeCo在不同的后门攻击、数据集和模型架构上具有更好的性能,AUROC高出10%且稳定性提高了5倍。