Existing backdoor defense methods are only effective for limited trigger types. To defend different trigger types at once, we start from the class-irrelevant nature of the poisoning process and propose a novel weakly supervised backdoor defense framework WeDef. Recent advances in weak supervision make it possible to train a reasonably accurate text classifier using only a small number of user-provided, class-indicative seed words. Such seed words shall be considered independent of the triggers. Therefore, a weakly supervised text classifier trained by only the poisoned documents without their labels will likely have no backdoor. Inspired by this observation, in WeDef, we define the reliability of samples based on whether the predictions of the weak classifier agree with their labels in the poisoned training set. We further improve the results through a two-phase sanitization: (1) iteratively refine the weak classifier based on the reliable samples and (2) train a binary poison classifier by distinguishing the most unreliable samples from the most reliable samples. Finally, we train the sanitized model on the samples that the poison classifier predicts as benign. Extensive experiments show that WeDefis effective against popular trigger-based attacks (e.g., words, sentences, and paraphrases), outperforming existing defense methods.
翻译:现有的后门防御方法只对有限的触发类型有效。 为了同时保护不同的触发类型, 我们从中毒过程的等级无关的性质出发, 并提出一个新的微弱监管的后门防御框架WeDef。 最近监管不力的进展使得有可能培训一个合理准确的文本分类器, 仅使用少量用户提供的、 等级指示种子单词。 这种种子词应被视为独立于触发器之外。 因此, 一个仅受有毒文件培训而没有标签的监管不力的文本分类器可能没有后门。 在WeDef的观察下, 我们根据弱分类器的预测是否与中毒训练组的标签一致来定义样本的可靠性。 我们通过两阶段的清洁化来进一步改进结果:(1) 反复完善基于可靠样品的薄弱分类器,(2) 通过区分最不可靠的样品来训练一个二进制毒分类器。 最后, 我们用毒分类器预测为良性的样本来训练防臭化模型。 广泛的实验显示, WeDefis 将有效的防控系统, 和现有触发性攻击方法。