A backdoor or Trojan attack is an important type of data poisoning attack against deep neural network (DNN) classifiers, wherein the training dataset is poisoned with a small number of samples that each possess the backdoor pattern (usually a pattern that is either imperceptible or innocuous) and which are mislabeled to the attacker's target class. When trained on a backdoor-poisoned dataset, a DNN behaves normally on most benign test samples but makes incorrect predictions to the target class when the test sample has the backdoor pattern incorporated (i.e., contains a backdoor trigger). Here we focus on image classification tasks and show that supervised training may build stronger association between the backdoor pattern and the associated target class than that between normal features and the true class of origin. By contrast, self-supervised representation learning ignores the labels of samples and learns a feature embedding based on images' semantic content. %We thus propose to use unsupervised representation learning to avoid emphasising backdoor-poisoned training samples and learn a similar feature embedding for samples of the same class. Using a feature embedding found by self-supervised representation learning, a data cleansing method, which combines sample filtering and re-labeling, is developed. Experiments on CIFAR-10 benchmark datasets show that our method achieves state-of-the-art performance in mitigating backdoor attacks.
翻译:后门或Trojan攻击是针对深神经网络(DNN)分类者的一种重要数据中毒袭击,其中培训数据集被毒害的样本数量不多,每个样本都拥有后门模式(通常是不易察觉或无意识的模式),并且被误贴到攻击者的目标类别。当在后门中毒数据集上接受培训时,DNN通常在大多数无害测试样本中行为,但在测试样本纳入后门模式(即含有后门触发器)时对目标类别作出不正确的预测。这里我们侧重于图像分类任务,并显示受监督的培训可能会在后门模式和相关目标类别之间建立比正常特征和真正来源类别之间更紧密的联系。相比之下,自我监督的代表学习忽略了样本标签标签标签标签,学习基于图像的语义内容嵌入特征。% 因此,我们建议使用不严密的演示演示表来避免强调后门匹配培训样本,并学习类似图像样本的类似特征嵌入特征,同时在升级的样本中进行自我升级。 使用模型嵌入式测试的模板将演示模型展示, 将自我升级的模板展示。