The success of machine learning is fueled by the increasing availability of computing power and large training datasets. The training data is used to learn new models or update existing ones, assuming that it is sufficiently representative of the data that will be encountered at test time. This assumption is challenged by the threat of poisoning, an attack that manipulates the training data to compromise the model's performance at test time. Although poisoning has been acknowledged as a relevant threat in industry applications, and a variety of different attacks and defenses have been proposed so far, a complete systematization and critical review of the field is still missing. In this survey, we provide a comprehensive systematization of poisoning attacks and defenses in machine learning, reviewing more than 100 papers published in the field in the last 15 years. We start by categorizing the current threat models and attacks, and then organize existing defenses accordingly. While we focus mostly on computer-vision applications, we argue that our systematization also encompasses state-of-the-art attacks and defenses for other data modalities. Finally, we discuss existing resources for research in poisoning, and shed light on the current limitations and open research questions in this research field.
翻译:计算机动力和大型培训数据集的日益普及促进了机器学习的成功。培训数据被用于学习新模型或更新现有模型,假设它能够充分代表测试时将遇到的数据。这一假设受到中毒威胁的挑战,这种攻击操纵了培训数据以损害测试时模型的性能。虽然人们承认中毒是行业应用中的一个相关威胁,而且迄今为止提出了各种不同的攻击和防御,但仍缺乏对该领域的全面系统化和批判性审查。在这次调查中,我们提供了中毒攻击和机器学习防御的全面系统化,审查了过去15年在外地发表的100多篇论文。我们首先对当前威胁模型和攻击进行分类,然后相应组织现有的防御。我们主要侧重于计算机-视觉应用,但我们认为,我们的系统化还包含最新技术攻击和其他数据模式的防御。最后,我们讨论了用于中毒研究的现有资源,并阐明了当前研究领域的局限性和公开研究问题。