Data poisoning causes misclassification of test time target examples by injecting maliciously crafted samples in the training data. Existing defenses are often effective only against a specific type of targeted attack, significantly degrade the generalization performance, or are prohibitive for standard deep learning pipelines. In this work, we propose an efficient defense mechanism that significantly reduces the success rate of various data poisoning attacks, and provides theoretical guarantees for the performance of the model. Targeted attacks work by adding bounded perturbations to a randomly selected subset of training data to match the targets' gradient or representation. We show that: (i) under bounded perturbations, only a number of poisons can be optimized to have a gradient that is close enough to that of the target and make the attack successful; (ii) such effective poisons move away from their original class and get isolated in the gradient space; (iii) dropping examples in low-density gradient regions during training can successfully eliminate the effective poisons, and guarantees similar training dynamics to that of training on full data. Our extensive experiments show that our method significantly decreases the success rate of state-of-the-art targeted attacks, including Gradient Matching and Bullseye Polytope, and easily scales to large datasets.
翻译:通过在培训数据中输入恶意制作的样本,造成测试时间目标示例的分类错误; 现有防御手段往往只对特定类型的定向攻击有效,大大降低一般性能,或对标准的深层学习管道来说是禁忌的; 在这项工作中,我们建议建立一个高效的防御机制,大幅降低各种数据中毒袭击的成功率,并为模型的运行提供理论保障; 有针对性的袭击工作,将受约束的扰动添加到随机选定的一组培训数据上,以与目标梯度或代表度相匹配。 我们的广泛实验表明:(一) 在受约束的扰动下,只有一些毒药可以优化,使梯度足以接近目标,使袭击成功;(二) 这种有效的毒药远离原有的类别,在梯度空间被孤立;(三) 在低密度梯度梯度地区,在培训期间丢弃实例,可以成功消除有效毒药,并保证与全面数据培训相类似的培训动态。 我们的广泛实验显示,我们的方法大大降低了州级定向定向攻击的成功率,可轻易降低大比例。