Existing textual adversarial attacks usually utilize the gradient or prediction confidence to generate adversarial examples, making it hard to be deployed in real-world applications. To this end, we consider a rarely investigated but more rigorous setting, namely hard-label attack, in which the attacker can only access the prediction label. In particular, we find we can learn the importance of different words via the change on prediction label caused by word substitutions on the adversarial examples. Based on this observation, we propose a novel adversarial attack, termed Text Hard-label attacker (TextHacker). TextHacker randomly perturbs lots of words to craft an adversarial example. Then, TextHacker adopts a hybrid local search algorithm with the estimation of word importance from the attack history to minimize the adversarial perturbation. Extensive evaluations for text classification and textual entailment show that TextHacker significantly outperforms existing hard-label attacks regarding the attack performance as well as adversary quality.
翻译:现有的文本对抗性攻击通常使用梯度或预测信心来生成对抗性例子,使得很难在现实世界应用中应用。 为此,我们认为一个很少经过调查但更加严格的设置,即硬标签攻击,攻击者只能进入预测性标签。 特别是,我们发现,通过对敌对性例子的文字替换导致的预测性标签变化,我们可以了解不同词的重要性。 根据这项观察,我们提议了一个新型的对抗性攻击,称为文本硬标签攻击者(TextHacker),文本Hacker随机地干扰了许多词汇来制作一个对抗性例子。 然后,TextHacker采用了一种混合本地搜索算法,用攻击性历史的文字重要性来估计词的重要性,以尽量减少对抗性干扰。 对文本分类和文字要求的广泛评价表明,TextHacker大大超过关于攻击性表现和对抗性质量的现有硬标签攻击性攻击。