A powerful category of (invisible) data poisoning attacks modify a subset of training examples by small adversarial perturbations to change the prediction of certain test-time data. Existing defense mechanisms are not desirable to deploy in practice, as they often either drastically harm the generalization performance, or are attack-specific, and prohibitively slow to apply. Here, we propose a simple but highly effective approach that unlike existing methods breaks various types of invisible poisoning attacks with the slightest drop in the generalization performance. We make the key observation that attacks introduce local sharp regions of high training loss, which when minimized, results in learning the adversarial perturbations and makes the attack successful. To break poisoning attacks, our key idea is to alleviate the sharp loss regions introduced by poisons. To do so, our approach comprises two components: an optimized friendly noise that is generated to maximally perturb examples without degrading the performance, and a randomly varying noise component. The combination of both components builds a very light-weight but extremely effective defense against the most powerful triggerless targeted and hidden-trigger backdoor poisoning attacks, including Gradient Matching, Bulls-eye Polytope, and Sleeper Agent. We show that our friendly noise is transferable to other architectures, and adaptive attacks cannot break our defense due to its random noise component.
翻译:强大的(看不见的)数据中毒袭击类别改变了一组训练例子,这些例子是小型对抗性干扰,以改变对某些试验时间数据的预测。现有的防御机制不适宜实际部署,因为这些机制往往会严重损害一般化性能,或者具有攻击性,而且应用速度非常缓慢。这里,我们提出了一个简单但非常有效的方法,与现有方法不同,它打破了各种类型的隐性中毒袭击,在一般化表现中稍有下降。我们的主要观察是,攻击引入了高度训练损失的当地尖锐区域,一旦最小化,就会导致学习对抗性扰动,并使攻击获得成功。为了打破中毒攻击,我们的关键想法是减轻毒药带来的严重损失区域。要做到这一点,我们的方法包括两个组成部分:一种最佳友好的噪音,在不降低性能的同时,以随机变化的方式打破各种隐蔽的噪音。两种组成部分的结合形成了一种非常轻度但极为有效的防御,以最强大、没有触发力的定向和隐藏的室内中毒袭击为目的的防御。Wedographen 攻击,包括渐进式对准的、牛眼固立和制的防御结构不能显示我们其他的防御结构。