Data poisoning is a threat model in which a malicious actor tampers with training data to manipulate outcomes at inference time. A variety of defenses against this threat model have been proposed, but each suffers from at least one of the following flaws: they are easily overcome by adaptive attacks, they severely reduce testing performance, or they cannot generalize to diverse data poisoning threat models. Adversarial training, and its variants, are currently considered the only empirically strong defense against (inference-time) adversarial attacks. In this work, we extend the adversarial training framework to defend against (training-time) data poisoning, including targeted and backdoor attacks. Our method desensitizes networks to the effects of such attacks by creating poisons during training and injecting them into training batches. We show that this defense withstands adaptive attacks, generalizes to diverse threat models, and incurs a better performance trade-off than previous defenses such as DP-SGD or (evasion) adversarial training.
翻译:数据中毒是一种威胁模式,恶意行为者篡改培训数据,以在推论时间操纵结果。提出了各种防范这种威胁模式,但每种模式都至少存在以下一个缺陷:它们很容易被适应性攻击所克服,它们严重降低测试性能,或者它们无法推广到各种不同的数据中毒威胁模式。反向培训及其变种目前被认为是唯一具有经验的有力防御(推论时间)对抗性攻击。在这项工作中,我们扩大了对抗性培训框架,以防范(培训-时间)数据中毒,包括目标攻击和后门攻击。我们的方法通过在培训期间制造毒药并将其注入培训批次,使网络对此类攻击的影响失去敏锐性。我们表明,这种防御能够抵御适应性攻击,对多种威胁模式加以概括,并比DP-SGD或(规避)对抗性训练等以往的防御方法产生更好的性交换。