Word-level textual adversarial attacks have achieved striking performance in fooling natural language processing models. However, the fundamental questions of why these attacks are effective, and the intrinsic properties of the adversarial examples (AEs), are still not well understood. This work attempts to interpret textual attacks through the lens of $n$-gram frequency. Specifically, it is revealed that existing word-level attacks exhibit a strong tendency toward generation of examples with $n$-gram frequency descend ($n$-FD). Intuitively, this finding suggests a natural way to improve model robustness by training the model on the $n$-FD examples. To verify this idea, we devise a model-agnostic and gradient-free AE generation approach that relies solely on the $n$-gram frequency information, and further integrate it into the recently proposed convex hull framework for adversarial training. Surprisingly, the resultant method performs quite similarly to the original gradient-based method in terms of model robustness. These findings provide a human-understandable perspective for interpreting word-level textual adversarial attacks, and a new direction to improve model robustness.
翻译:在欺骗自然语言处理模式方面,文字对抗性攻击在欺骗自然语言处理模式方面已经取得了惊人的成绩。然而,这些攻击之所以有效,以及对抗性例子(AEs)的内在性质等根本问题仍然没有得到很好理解。这项工作试图用美元-克频率的镜头来解释文字攻击。具体地说,人们发现,现有的字级攻击呈现出一种强烈的趋势,即以美元-克频率的降价生成例子。这个发现直观地表明,通过在美元-FD模型上培训模型来改进模型的稳健性是自然的。为了核实这一想法,我们设计了一种仅仅依靠美元-克频率信息的模型认知性和无梯度的AE一代方法,并将它进一步纳入最近提议的关于对抗性训练的Convex船体框架。令人惊讶的是,由此产生的方法在模型稳健健性方面与最初的梯度法方法表现得非常相似。这些发现为解释文字对抗性攻击提供了一种人无法理解的视角,并提供了一种改进模型稳健性的新方向。