Adversarial attacks are label-preserving modifications to inputs of machine learning classifiers designed to fool machines but not humans. Natural Language Processing (NLP) has mostly focused on high-level attack scenarios such as paraphrasing input texts. We argue that these are less realistic in typical application scenarios such as in social media, and instead focus on low-level attacks on the character-level. Guided by human cognitive abilities and human robustness, we propose the first large-scale catalogue and benchmark of low-level adversarial attacks, which we dub Z\'eroe, encompassing nine different attack modes including visual and phonetic adversaries. We show that RoBERTa, NLP's current workhorse, fails on our attacks. Our dataset provides a benchmark for testing robustness of future more human-like NLP models.
翻译:自然语言处理(NLP)主要关注高级攻击情景,例如抛光输入文本。我们争辩说,在典型的应用情景中,如在社交媒体中,这些不太现实,而是关注对品格水平的低度攻击。根据人类认知能力和人性强健性,我们提议了第一个大型的低级对抗性攻击目录和基准,我们建议了这个目录和基准,它包括9种不同的攻击模式,包括视觉和语音对手。我们显示,NLP目前的工作马RoBERTA在攻击上失败了。我们的数据集为测试未来更像人类的NLP模型的稳健性提供了一个基准。