Deep neural networks are vulnerable to adversarial examples in Natural Language Processing. However, existing textual adversarial attacks usually utilize the gradient or prediction confidence to generate adversarial examples, making it hard to be deployed in real-world applications. To this end, we consider a rarely investigated but more rigorous setting, namely hard-label attack, in which the attacker could only access the prediction label. In particular, we find that the changes on prediction label caused by word substitutions on the adversarial example could precisely reflect the importance of different words. Based on this observation, we propose a novel hard-label attack, called Learning-based Hybrid Local Search (LHLS) algorithm, which effectively estimates word importance with the prediction label from the attack history and integrate such information into hybrid local search algorithm to optimize the adversarial perturbation. Extensive evaluations for text classification and textual entailment using various datasets and models show that our LHLS significantly outperforms existing hard-label attacks regarding the attack performance as well as adversary quality.
翻译:然而,现有的文字对抗性攻击通常使用梯度或预测信心来生成对抗性例子,因此很难在现实世界应用中应用。为此,我们认为很少经过调查但更严格的环境,即硬标签攻击,攻击者只能进入预测标签。特别是,我们发现,在对抗性攻击的例子中,用词替换导致的预测标签变化能够准确地反映不同词的重要性。基于这一观察,我们提议采用一种新的硬标签攻击,称为学习性混合地方搜索(LHLS)算法,与攻击史的预测标签有效估计单词重要性,并将这类信息纳入当地混合搜索算法,以优化对抗性扰动。对文本分类和文字要求的广泛评价,利用各种数据集和模型显示,我们的LHLS大大超越了攻击性能方面的现有硬标签攻击。