Most adversarial attack methods that are designed to deceive a text classifier change the text classifier's prediction by modifying a few words or characters. Few try to attack classifiers by rewriting a whole sentence, due to the difficulties inherent in sentence-level rephrasing as well as the problem of setting the criteria for legitimate rewriting. In this paper, we explore the problem of creating adversarial examples with sentence-level rewriting. We design a new sampling method, named ParaphraseSampler, to efficiently rewrite the original sentence in multiple ways. Then we propose a new criteria for modification, called a sentence-level threaten model. This criteria allows for both word- and sentence-level changes, and can be adjusted independently in two dimensions: semantic similarity and grammatical quality. Experimental results show that many of these rewritten sentences are misclassified by the classifier. On all 6 datasets, our ParaphraseSampler achieves a better attack success rate than our baseline.
翻译:多数对抗性攻击方法旨在欺骗文本分类器,通过修改几个词或字符来改变文本分类器的预测。 很少有人试图通过重写整句子来攻击分类器, 这是因为在句级改写中固有的困难以及制定合法改写标准的问题。 在本文中, 我们探索了创建带有句级改写的对抗性例子的问题。 我们设计了一个新的取样方法, 名为ParasphanSampler, 以多种方式有效地重写原句子。 然后我们提出了一个新的修改标准, 称之为句级威胁模型。 这个标准允许文字和句级的修改, 并且可以在两个方面独立调整: 语义相似性和语法质量。 实验结果表明, 许多这些重写句子被分类器错误地分类。 在全部6个数据集中, 我们的ParasplaphoneSampler 取得了比我们基线更好的攻击成功率 。