A line of work has shown that natural text processing models are vulnerable to adversarial examples. Correspondingly, various defense methods are proposed to mitigate the threat of textual adversarial examples, eg, adversarial training, input transformations, detection, etc. In this work, we treat the optimization process for synonym substitution based textual adversarial attacks as a specific sequence of word replacement, in which each word mutually influences other words. We identify that we could destroy such mutual interaction and eliminate the adversarial perturbation by randomly substituting a word with its synonyms. Based on this observation, we propose a novel textual adversarial example detection method, termed Randomized Substitution and Vote (RS&V), which votes the prediction label by accumulating the logits of k samples generated by randomly substituting the words in the input text with synonyms. The proposed RS&V is generally applicable to any existing neural networks without modification on the architecture or extra training, and it is orthogonal to prior work on making the classification network itself more robust. Empirical evaluations on three benchmark datasets demonstrate that our RS&V could detect the textual adversarial examples more successfully than the existing detection methods while maintaining the high classification accuracy on benign samples.
翻译:一项工作表明,自然文本处理模式很容易受到对抗性实例的影响。相应的,提出了各种防御方法,以减轻文字对抗性例子的威胁,例如,对抗性训练、输入转换、检测等。在这项工作中,我们把同义替代文本对抗性攻击的优化程序作为单词替换的具体顺序,其中每个词相互影响其他词。我们确定,我们可以通过随机用同义词替换一个词来破坏这种相互互动,消除对抗性扰动。根据这一观察,我们提出了一种新颖的文字对抗性例子探测方法,称为随机替代和投票(RS&V),通过随机用同义词取代输入文本中的字词来累积 k 样本的对数来记录预测标签。提议的RS&V一般适用于任何现有的神经网络,而不对结构或额外培训进行修改,而且与以前关于使分类网络本身更稳健的工作有交织。对三种基准数据基评估显示,在保持高正义性检测方法的同时,我们现有的比较性样本能够成功地探测高的文本。