Text-based adversarial attacks are becoming more commonplace and accessible to general internet users. As these attacks proliferate, the need to address the gap in model robustness becomes imminent. While retraining on adversarial data may increase performance, there remains an additional class of character-level attacks on which these models falter. Additionally, the process to retrain a model is time and resource intensive, creating a need for a lightweight, reusable defense. In this work, we propose the Adversarial Text Normalizer, a novel method that restores baseline performance on attacked content with low computational overhead. We evaluate the efficacy of the normalizer on two problem areas prone to adversarial attacks, i.e. Hate Speech and Natural Language Inference. We find that text normalization provides a task-agnostic defense against character-level attacks that can be implemented supplementary to adversarial retraining solutions, which are more suited for semantic alterations.
翻译:以文字为基础的对抗性攻击越来越常见,并且为一般互联网用户所使用。随着这些攻击的激增,解决模型稳健性差距的必要性就迫在眉睫。虽然关于对抗性数据的再培训可能会提高性能,但还有另外一类品格攻击,这些模式对此有所动摇。此外,对一个模式进行再培训的过程是时间和资源密集的,这就需要轻量、可重复使用的防御。在这项工作中,我们建议采用反向文本正常化器,这是一种新颖的方法,它恢复了在低计算间接费用的被攻击内容上的基准性能。我们评估了在容易发生对抗性攻击的两个问题领域,即仇恨言语和自然语言推断的正常化者的效率。我们发现,文本正常化提供了一种任务性能防御性攻击的防御,可以作为对抗性攻击解决办法的补充,而对抗性攻击更适合语义改变。