Recent studies show that Deep Neural Networks (DNNs) are vulnerable to backdoor attacks. An infected model behaves normally on benign inputs, whereas its prediction will be forced to an attack-specific target on adversarial data. Several detection methods have been developed to distinguish inputs to defend against such attacks. The common hypothesis that these defenses rely on is that there are large statistical differences between the latent representations of clean and adversarial inputs extracted by the infected model. However, although it is important, comprehensive research on whether the hypothesis must be true is lacking. In this paper, we focus on it and study the following relevant questions: 1) What are the properties of the statistical differences? 2) How to effectively reduce them without harming the attack intensity? 3) What impact does this reduction have on difference-based defenses? Our work is carried out on the three questions. First, by introducing the Maximum Mean Discrepancy (MMD) as the metric, we identify that the statistical differences of multi-level representations are all large, not just the highest level. Then, we propose a Statistical Difference Reduction Method (SDRM) by adding a multi-level MMD constraint to the loss function during training a backdoor model to effectively reduce the differences. Last, three typical difference-based detection methods are examined. The F1 scores of these defenses drop from 90%-100% on the regularly trained backdoor models to 60%-70% on the models trained with SDRM on all two datasets, four model architectures, and four attack methods. The results indicate that the proposed method can be used to enhance existing attacks to escape backdoor detection algorithms.
翻译:最近的研究显示,深神经网络(DNNS)很容易受到幕后攻击。 受感染的模式通常以良性投入为主, 而它的预测将被迫以对抗性数据为攻击目标。 已经开发了几种检测方法, 以区分针对攻击的投入。 这些防御的通常假设是, 受感染模式所提取的清洁和对抗性投入的潜在表现在统计上存在很大的差异。 但是, 尽管它很重要, 有关假设是否必须属实的全面研究是缺乏的。 在本文中, 我们集中研究以下相关问题:(1) 统计差异的属性是什么? 2 如何有效地减少这些差异而不伤害攻击强度? 3 这样的减少对基于不同攻击的防御有何影响? 我们的工作是在三个问题上展开的。 首先, 通过引入受感染模式( MMD) 的最大偏差(MD) 作为衡量标准, 我们发现多级表达的统计差异都很大, 而不仅仅是最高程度。 然后, 我们提出一个统计差异减少方法(SDRMM), 将多级模型增加一个多级模型的多级模型到基于攻击的多级模型。