Adversarial attacks and backdoor attacks are two common security threats that hang over deep learning. Both of them harness task-irrelevant features of data in their implementation. Text style is a feature that is naturally irrelevant to most NLP tasks, and thus suitable for adversarial and backdoor attacks. In this paper, we make the first attempt to conduct adversarial and backdoor attacks based on text style transfer, which is aimed at altering the style of a sentence while preserving its meaning. We design an adversarial attack method and a backdoor attack method, and conduct extensive experiments to evaluate them. Experimental results show that popular NLP models are vulnerable to both adversarial and backdoor attacks based on text style transfer -- the attack success rates can exceed 90% without much effort. It reflects the limited ability of NLP models to handle the feature of text style that has not been widely realized. In addition, the style transfer-based adversarial and backdoor attack methods show superiority to baselines in many aspects. All the code and data of this paper can be obtained at https://github.com/thunlp/StyleAttack.
翻译:反向攻击和后门攻击是两个共同的安全威胁,贯穿于深层学习,两者都利用了与任务相关的数据特征。文本样式是一个自然地与大多数NLP任务无关的特征,因此适合对抗性攻击和后门攻击。在本文中,我们第一次尝试以文本样式转移为基础进行对抗性攻击和后门攻击,目的是在保留其意义的同时改变句子的风格。我们设计了对抗性攻击方法和后门攻击方法,并进行了广泛的实验以评价它们。实验结果表明,流行的NLP模式很容易受到基于文本样式转移的对抗性攻击和后门攻击。攻击成功率可以超过90%,而没有作出很大努力。这反映了NLP模式处理文本样式特征的能力有限,但这种模式尚未普遍实现。此外,基于对抗性和后门攻击方法在许多方面表现出了基线的优势。本文的所有代码和数据可以在https://github.com/thunp/STyAttack获得。