Data Poisoning attacks modify training data to maliciously control a model trained on such data. In this work, we focus on targeted poisoning attacks which cause a reclassification of an unmodified test image and as such breach model integrity. We consider a particularly malicious poisoning attack that is both "from scratch" and "clean label", meaning we analyze an attack that successfully works against new, randomly initialized models, and is nearly imperceptible to humans, all while perturbing only a small fraction of the training data. Previous poisoning attacks against deep neural networks in this setting have been limited in scope and success, working only in simplified settings or being prohibitively expensive for large datasets. The central mechanism of the new attack is matching the gradient direction of malicious examples. We analyze why this works, supplement with practical considerations. and show its threat to real-world practitioners, finding that it is the first poisoning method to cause targeted misclassification in modern deep networks trained from scratch on a full-sized, poisoned ImageNet dataset. Finally we demonstrate the limitations of existing defensive strategies against such an attack, concluding that data poisoning is a credible threat, even for large-scale deep learning systems.
翻译:在这项工作中,我们把重点放在了定向中毒袭击上,这些袭击导致对未经修改的测试图像进行重新分类,并导致这种破坏模式的完整性。我们考虑的是一种特别恶意中毒袭击,既“从零到零”又“清洁标签”,这意味着我们分析的是一种袭击成功地针对新的随机初始化模型,而且几乎是人类无法察觉的,而这一切只是干扰了培训数据中的一小部分。在这个环境中,以前对深神经网络的中毒袭击在范围和成功方面都很有限,只是在简化的环境下工作,或者对大型数据集来说费用太高。新袭击的中心机制是匹配恶意实例的梯度方向。我们分析为什么这种袭击起作用,用实际的考虑因素加以补充。并展示其对现实世界从业者的威胁,发现这是在全尺寸、有毒的图像网络数据集上从刮伤中训练的现代深网络中导致定向分类错误的第一个中毒方法。最后,我们证明现有防范这种袭击的策略的局限性,结论是数据中毒是一种可信的威胁,即使对于大规模深层学习系统也是如此。