Deep neural networks are vulnerable to adversarial attacks, where a small perturbation to an input alters the model prediction. In many cases, malicious inputs intentionally crafted for one model can fool another model. In this paper, we present the first study to systematically investigate the transferability of adversarial examples for text classification models and explore how various factors, including network architecture, tokenization scheme, word embedding, and model capacity, affect the transferability of adversarial examples. Based on these studies, we propose a genetic algorithm to find an ensemble of models that can be used to induce adversarial examples to fool almost all existing models. Such adversarial examples reflect the defects of the learning process and the data bias in the training set. Finally, we derive word replacement rules that can be used for model diagnostics from these adversarial examples.
翻译:深神经网络很容易受到对抗性攻击,对输入的微小扰动改变了模型预测。 在许多情况下,故意为一个模型设计的恶意输入会愚弄另一个模型。 在本文中,我们提出第一项研究,以系统调查文本分类模型对抗性例子的可转让性,并探讨各种因素,包括网络结构、象征性计划、字嵌入和模型能力,如何影响对抗性例子的可转让性。根据这些研究,我们提出基因算法,以找到一套模型,用来诱导对抗性例子,以愚弄几乎所有现有模型。这些对抗性例子反映了学习过程的缺陷和成套培训中的数据偏差。最后,我们从这些对抗性例子中得出可用于模型诊断的换字规则。