Over the past few years, various word-level textual attack approaches have been proposed to reveal the vulnerability of deep neural networks used in natural language processing. Typically, these approaches involve an important optimization step to determine which substitute to be used for each word in the original input. However, current research on this step is still rather limited, from the perspectives of both problem-understanding and problem-solving. In this paper, we address these issues by uncovering the theoretical properties of the problem and proposing an efficient local search algorithm (LS) to solve it. We establish the first provable approximation guarantee on solving the problem in general cases. Notably, for adversarial textual attack, it is even better than the previous bound which only holds in special case. Extensive experiments involving five NLP tasks, six datasets and eleven NLP models show that LS can largely reduce the number of queries usually by an order of magnitude to achieve high attack success rates. Further experiments show that the adversarial examples crafted by LS usually have higher quality, exhibit better transferability, and can bring more robustness improvement to victim models by adversarial training.
翻译:过去几年来,人们提出了各种字级文字攻击方法,以揭示在自然语言处理中使用的深层神经网络的脆弱性。通常,这些方法涉及一个重要的优化步骤,以确定在最初输入的每个词中使用哪种替代物。然而,从问题理解和解决问题的角度来看,目前关于这一步骤的研究仍然相当有限。在本文件中,我们通过发现问题的理论性质和提出高效的当地搜索算法来解决这些问题。我们建立了在一般情况下解决该问题的第一个可行的近似保证。值得注意的是,对于对抗性文字攻击来说,它比以前只有特殊情况下才能使用的界限要好得多。涉及五个NLP任务、六个数据集和11个NLP模型的广泛实验表明,LS可以大量减少询问的数量,通常以达到高攻击成功率的幅度排序。进一步实验表明,LS所设计的对抗性实例通常质量更高,显示更易转移性,并且可以通过对抗性训练使受害者模型更加稳健。