In this work, we propose a new and general framework to defend against backdoor attacks, inspired by the fact that attack triggers usually follow a \textsc{specific} type of attacking pattern, and therefore, poisoned training examples have greater impacts on each other during training. We introduce the notion of the {\it influence graph}, which consists of nodes and edges respectively representative of individual training points and associated pair-wise influences. The influence between a pair of training points represents the impact of removing one training point on the prediction of another, approximated by the influence function \citep{koh2017understanding}. Malicious training points are extracted by finding the maximum average sub-graph subject to a particular size. Extensive experiments on computer vision and natural language processing tasks demonstrate the effectiveness and generality of the proposed framework.
翻译:在这项工作中,我们提出了一个新的总体框架来防范幕后攻击,其依据是攻击触发器通常遵循攻击模式的\ textsc{ 具体} 类型,因此,毒化培训实例在培训期间相互影响更大。我们引入了 ~ 影响图的概念,它分别由代表个别训练点和相关双向影响的节点和边缘组成。一对培训点之间的影响力代表了取消一个培训点对预测另一个培训点的影响,根据影响函数 \ citep{koh2017 理解 。通过找到具有特定大小的最大平均子集来提取恶意培训点。关于计算机视野和自然语言处理任务的广泛实验显示了拟议框架的有效性和普遍性。