RAP:在后门效应较强、后门样本的对抗鲁棒性较高时效果较好,但当后门效应在最终的模型中被削弱时(如在 APMF 设定下用户在干净数据上 fine-tune 和 EP 攻击中只对 word embedding 下毒),后门样本的对抗鲁棒性也不好,施加对抗扰动后预测概率如干净样本一般明显下降,使防御表现不佳。
我们的工作揭示了文本后门攻击中下毒样本和干净样本在模型的中间层特征空间中是显著可分的,基于此观察,我们设计了一种基于特征距离的在线后门防御方法 DAN 来保护可能受到后门攻击的 NLP 模型。在各种攻击设定下充分的实验和分析说明了 DAN 对基线方法的显著性能优势,此外,DAN 还享有计算开销小、部署方便的优势。未来,我们可能在两方面基于本工作继续探索文本后门攻击中下毒样本在特征空间中的特性:
[1]: Dai, Jiazhu, Chuanshuai Chen, and Yufeng Li. "A backdoor attack against lstm-based text classification systems." IEEE Access 7 (2019): 138872-138878. [2]: Kurita, Keita, Paul Michel, and Graham Neubig. "Weight Poisoning Attacks on Pretrained Models." In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 2793-2806. 2020. [3]: Yang, Wenkai, Lei Li, Zhiyuan Zhang, Xuancheng Ren, Xu Sun, and Bin He. "Be Careful about Poisoned Word Embeddings: Exploring the Vulnerability of the Embedding Layers in NLP Models." In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2048-2058. 2021. [4]: Li, Linyang, Demin Song, Xiaonan Li, Jiehang Zeng, Ruotian Ma, and Xipeng Qiu. "Backdoor Attacks on Pre-trained Models by Layerwise Weight Poisoning." In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp. 3023-3032. 2021. [5]: Zhang, Zhengyan, Guangxuan Xiao, Yongwei Li, Tian Lv, Fanchao Qi, Zhiyuan Liu, Yasheng Wang, Xin Jiang, and Maosong Sun. "Red alarm for pre-trained models: Universal vulnerabilities by neuron-level backdoor attacks." (2021). [6]: Chen, Kangjie, Yuxian Meng, Xiaofei Sun, Shangwei Guo, Tianwei Zhang, Jiwei Li, and Chun Fan. "BadPre: Task-agnostic Backdoor Attacks to Pre-trained NLP Foundation Models." In International Conference on Learning Representations. 2021.[7]: Chen, Chuanshuai, and Jiazhu Dai. "Mitigating backdoor attacks in lstm-based text classification systems by backdoor keyword identification." Neurocomputing 452 (2021): 253-262. [8]: Azizi, Ahmadreza, Ibrahim Asadullah Tahmid, Asim Waheed, Neal Mangaokar, Jiameng Pu, Mobin Javed, Chandan K. Reddy, and Bimal Viswanath. "{T-Miner}: A Generative Approach to Defend Against Trojan Attacks on {DNN-based} Text Classification." In 30th USENIX Security Symposium (USENIX Security 21), pp. 2255-2272. 2021. [9]: Liu, Kang, Brendan Dolan-Gavitt, and Siddharth Garg. "Fine-pruning: Defending against backdooring attacks on deep neural networks." In International Symposium on Research in Attacks, Intrusions, and Defenses, pp. 273-294. Springer, Cham, 2018.
[10]: Zhang, Zhiyuan, Lingjuan Lyu, Xingjun Ma, Chenguang Wang, and Xu Sun. "Fine-mixing: Mitigating Backdoors in Fine-tuned Language Models." arXiv preprint arXiv:2210.09545 (2022).
[11]: Gao, Yansong, Yeonjae Kim, Bao Gia Doan, Zhi Zhang, Gongxuan Zhang, Surya Nepal, Damith C. Ranasinghe, and Hyoungshick Kim. "Design and evaluation of a multi-domain trojan detection method on deep neural networks." IEEE Transactions on Dependable and Secure Computing 19, no. 4 (2021): 2349-2364.
[12]: Qi, Fanchao, Yangyi Chen, Mukai Li, Yuan Yao, Zhiyuan Liu, and Maosong Sun. "ONION: A Simple and Effective Defense Against Textual Backdoor Attacks." In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp. 9558-9566. 2021.
[13]: Yang, Wenkai, Yankai Lin, Peng Li, Jie Zhou, and Xu Sun. "RAP: Robustness-Aware Perturbations for Defending against Backdoor Attacks on NLP Models." In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp. 8365-8381. 2021. [14]: Chen, Xiaoyi, Ahmed Salem, Michael Backes, Shiqing Ma, and Yang Zhang. "Badnl: Backdoor attacks against nlp models." In ICML 2021 Workshop on Adversarial Machine Learning. 2021. [15]: Maqsood, Shaik Mohammed, Viveros Manuela Ceron, and Addluri GowthamKrishna. "Backdoor Attack against NLP models with Robustness-Aware Perturbation defense." arXiv preprint arXiv:2204.05758 (2022). [16]: Ma, Xingjun, Bo Li, Yisen Wang, Sarah M. Erfani, Sudanthi Wijewickrema, Grant Schoenebeck, Dawn Song, Michael E. Houle, and James Bailey. "Characterizing Adversarial Subspaces Using Local Intrinsic Dimensionality." In International Conference on Learning Representations. 2018.
[17]: Chen, Bryant, Wilka Carvalho, Nathalie Baracaldo, Heiko Ludwig, Benjamin Edwards, Taesung Lee, Ian Molloy, and Biplav Srivastava. "Detecting Backdoor Attacks on Deep Neural Networks by Activation Clustering."
[18]: Lee, Kimin, Kibok Lee, Honglak Lee, and Jinwoo Shin. "A simple unified framework for detecting out-of-distribution samples and adversarial attacks." Advances in neural information processing systems 31 (2018).
[19]: Gu, Tianyu, Brendan Dolan-Gavitt, and Siddharth Garg. "Badnets: Identifying vulnerabilities in the machine learning model supply chain." arXiv preprint arXiv:1708.06733 (2017).