Machine learning algorithms are often vulnerable to adversarial examples that have imperceptible alterations from the original counterparts but can fool the state-of-the-art models. It is helpful to evaluate or even improve the robustness of these models by exposing the maliciously crafted adversarial examples. In this paper, we present the TextFooler, a general attack framework, to generate natural adversarial texts. By successfully applying it to two fundamental natural language tasks, text classification and textual entailment, against various target models, convolutional and recurrent neural networks as well as the most powerful pre-trained BERT, we demonstrate the advantages of this framework in three ways: (i) effective---it outperforms state-of-the-art attacks in terms of success rate and perturbation rate; (ii) utility-preserving---it preserves semantic content and grammaticality, and remains correctly classified by humans; and (iii) efficient---it generates adversarial text with computational complexity linear in the text length.
翻译:机器学习算法往往易受原始对应方无法察觉到的改变的对抗性实例的影响,但可能愚弄最先进的模型。通过揭露恶意制造的对抗性实例,评估或甚至提高这些模型的稳健性是有益的。在本文中,我们介绍了TextFooler这一一般攻击框架,以产生自然对抗性文本。通过成功地将它应用于两种基本的自然语言任务,即文字分类和文字要求,以对抗各种目标模型、动态和经常性神经网络以及最强大的预先培训的BERT,我们以三种方式展示了这一框架的优势:(一) 在成功率和扰动率方面,有效-它优于艺术攻击状态;(二) 实用-保留-它保存语义内容和语法性,并且仍然由人类正确分类;(三) 高效-它生成具有计算复杂性的文字长度线性的对抗性文字。