Recently, advanced NLP models have seen a surge in the usage of various applications. This raises the security threats of the released models. In addition to the clean models' unintentional weaknesses, {\em i.e.,} adversarial attacks, the poisoned models with malicious intentions are much more dangerous in real life. However, most existing works currently focus on the adversarial attacks on NLP models instead of positioning attacks, also named \textit{backdoor attacks}. In this paper, we first propose the \textit{natural backdoor attacks} on NLP models. Moreover, we exploit the various attack strategies to generate trigger on text data and investigate different types of triggers based on modification scope, human recognition, and special cases. Last, we evaluate the backdoor attacks, and the results show the excellent performance of with 100\% backdoor attacks success rate and sacrificing of 0.83\% on the text classification task.
翻译:最近,先进的NLP模型发现各种应用的使用激增,这增加了释放模型的安全威胁。除了清洁模型的无意弱点,即对抗性攻击之外,恶意的有毒模型在现实生活中更加危险。然而,大多数现有工作目前侧重于对NLP模型的对抗性攻击,而不是定位攻击,也称为\ textit{后门攻击}。在本文中,我们首先提议对NLP模型采用\ textit{自然后门攻击}。此外,我们利用各种攻击战略来触发文本数据并调查基于修改范围、人类识别和特殊案例的不同类型的触发。最后,我们评估后门攻击,结果显示100 ⁇ 后门攻击成功率和在文本分类任务上牺牲0.83 ⁇ 。