Although Deep Neural Network (DNN) has led to unprecedented progress in various natural language processing (NLP) tasks, research shows that deep models are extremely vulnerable to backdoor attacks. The existing backdoor attacks mainly inject a small number of poisoned samples into the training dataset with the labels changed to the target one. Such mislabeled samples would raise suspicion upon human inspection, potentially revealing the attack. To improve the stealthiness of textual backdoor attacks, we propose the first clean-label framework Kallima for synthesizing mimesis-style backdoor samples to develop insidious textual backdoor attacks. We modify inputs belonging to the target class with adversarial perturbations, making the model rely more on the backdoor trigger. Our framework is compatible with most existing backdoor triggers. The experimental results on three benchmark datasets demonstrate the effectiveness of the proposed method.
翻译:虽然深神经网络(DNN)在自然语言处理(NLP)的各种任务方面取得了前所未有的进展,但研究表明,深度模型极易受到后门攻击的伤害,现有的后门攻击主要是将少量有毒样品输入训练数据集,标签被更改为目标。这种标签错误的样品会引起人们对人体检查的怀疑,有可能暴露出攻击。为了改善文字后门攻击的隐形性,我们提议第一个清洁标签的Kallima Kallima框架,用于合成模仿式后门样品,以开发隐蔽的文字后门攻击。我们用对抗性干扰修改属于目标类别的输入,使模型更多地依赖后门触发。我们的框架与大多数现有的后门触发器兼容。三个基准数据集的实验结果显示了拟议方法的有效性。