Generating high-quality textual adversarial examples is critical for investigating the pitfalls of natural language processing (NLP) models and further promoting their robustness. Existing attacks are usually realized through word-level or sentence-level perturbations, which either limit the perturbation space or sacrifice fluency and textual quality, both affecting the attack effectiveness. In this paper, we propose Phrase-Level Textual Adversarial aTtack (PLAT) that generates adversarial samples through phrase-level perturbations. PLAT first extracts the vulnerable phrases as attack targets by a syntactic parser, and then perturbs them by a pre-trained blank-infilling model. Such flexible perturbation design substantially expands the search space for more effective attacks without introducing too many modifications, and meanwhile maintaining the textual fluency and grammaticality via contextualized generation using surrounding texts. Moreover, we develop a label-preservation filter leveraging the likelihoods of language models fine-tuned on each class, rather than textual similarity, to rule out those perturbations that potentially alter the original class label for humans. Extensive experiments and human evaluation demonstrate that PLAT has a superior attack effectiveness as well as a better label consistency than strong baselines.
翻译:生成高质量的文本对抗实例对于调查自然语言处理(NLP)模型的缺陷和进一步促进其稳健性至关重要。现有的攻击通常通过字级或句级扰动来完成,它们要么限制扰动空间,要么牺牲流畅度和文本质量,两者都影响攻击效果。在本文中,我们建议通过语系层次的扰动来生成对抗样本。PLAT首先通过合成的剖析器将脆弱语句作为攻击目标进行提取,然后通过事先训练的空白填充模型来扰动这些语句。这种灵活的扰动设计大大扩大了更有效攻击的搜索空间,而没有引入太多的修改,同时通过使用周围文字的贴近背景的生成来保持文本层次的流畅和语调。此外,我们开发了一个标签-保护过滤器,利用语言模型对每一类的调整可能性,而不是文本相似性,来排除那些经过事先训练的“空白填充”模型。这种灵活的扰动性设计大大扩展了那些可能改变人类原类标签的精确性,从而改变人类原始标签的精确性。