The field of adversarial textual attack has significantly grown over the last years, where the commonly considered objective is to craft adversarial examples that can successfully fool the target models. However, the imperceptibility of attacks, which is also an essential objective, is often left out by previous studies. In this work, we advocate considering both objectives at the same time, and propose a novel multi-optimization approach (dubbed HydraText) with provable performance guarantee to achieve successful attacks with high imperceptibility. We demonstrate the efficacy of HydraText through extensive experiments under both score-based and decision-based settings, involving five modern NLP models across five benchmark datasets. In comparison to existing state-of-the-art attacks, HydraText consistently achieves simultaneously higher success rates, lower modification rates, and higher semantic similarity to the original texts. A human evaluation study shows that the adversarial examples crafted by HydraText maintain validity and naturality well. Finally, these examples also exhibit good transferability and can bring notable robustness improvement to the target models by adversarial training.
翻译:过去几年来,对抗性文字攻击领域有了显著发展,人们通常认为,目标是形成能够成功愚弄目标模型的对抗性例子。然而,攻击的不可感知性(这也是一项基本目标)往往被先前的研究所忽略。在这项工作中,我们主张同时考虑这两个目标,并提出新的多优化方法(低位海德拉Text),其可辨别的性能保证在高度不易感知的情况下成功攻击。我们通过在基于分数和基于决策的环境下进行广泛试验,通过涉及五个基准数据集的五个现代NLP模型,展示了HydalText的功效。与现有的最新技术攻击相比,HydalText始终在同时取得更高的成功率、较低的修改率和与原始文本相似的语义性。一项人类评价研究表明,HyText所设计的对抗性例子保持了良好的有效性和自然性。最后,这些例子还显示了良好的可转移性,并通过对抗性培训使目标模型取得显著的稳健性改进。