Robustness of huge Transformer-based models for natural language processing is an important issue due to their capabilities and wide adoption. One way to understand and improve robustness of these models is an exploration of an adversarial attack scenario: check if a small perturbation of an input can fool a model. Due to the discrete nature of textual data, gradient-based adversarial methods, widely used in computer vision, are not applicable per~se. The standard strategy to overcome this issue is to develop token-level transformations, which do not take the whole sentence into account. In this paper, we propose a new black-box sentence-level attack. Our method fine-tunes a pre-trained language model to generate adversarial examples. A proposed differentiable loss function depends on a substitute classifier score and an approximate edit distance computed via a deep learning model. We show that the proposed attack outperforms competitors on a diverse set of NLP problems for both computed metrics and human evaluation. Moreover, due to the usage of the fine-tuned language model, the generated adversarial examples are hard to detect, thus current models are not robust. Hence, it is difficult to defend from the proposed attack, which is not the case for other attacks.
翻译:以巨型变压器为基础的自然语言处理模型的坚固性是一个重要的问题,因为其能力和广泛采用。理解和提高这些模型的稳健性的方法之一是探索一种对抗性攻击情景:检查一个输入的小扰动是否能愚弄一个模型。由于文本数据的离散性质,计算机视觉中广泛使用的基于梯度的对抗方法是不适用的。克服这一问题的标准战略是开发象征性的变换,这种变换不考虑整个句子。在本文中,我们建议采用一个新的黑盒句级攻击。我们的方法微调一种预先训练的语言模型来生成对抗性例子。提议的可区别损失函数取决于替代分类器的分数和通过深层次学习模型计算的大致编辑距离。我们表明,拟议的攻击在计算度和人文评价方面优于一系列不同的NLP问题上优于竞争者。此外,由于使用微调的语言模型,产生的对抗性例子很难探测出来,因此目前的模型对于攻击来说并不牢固。因此,从其他攻击的例子中很难辩护。