Deep-learning-based NLP models are found to be vulnerable to word substitution perturbations. Before they are widely adopted, the fundamental issues of robustness need to be addressed. Along this line, we propose a formal framework to evaluate word-level robustness. First, to study safe regions for a model, we introduce robustness radius which is the boundary where the model can resist any perturbation. As calculating the maximum robustness radius is computationally hard, we estimate its upper and lower bound. We repurpose attack methods as ways of seeking upper bound and design a pseudo-dynamic programming algorithm for a tighter upper bound. Then verification method is utilized for a lower bound. Further, for evaluating the robustness of regions outside a safe radius, we reexamine robustness from another view: quantification. A robustness metric with a rigorous statistical guarantee is introduced to measure the quantification of adversarial examples, which indicates the model's susceptibility to perturbations outside the safe radius. The metric helps us figure out why state-of-the-art models like BERT can be easily fooled by a few word substitutions, but generalize well in the presence of real-world noises.
翻译:以深学习为基础的 NLP 模型被认为容易受文字替换扰动的影响。 在被广泛采用之前, 需要解决稳健性的基本问题 。 在这条线上, 我们提议一个正式框架来评估字级稳健性 。 首先, 为模型研究安全区域 。 首先, 我们引入稳健性半径, 这是模型可以抵抗任何扰动的边界 。 由于计算最大稳健性半径是计算硬的, 我们估计其上下下限 。 我们重新使用攻击方法作为寻求上下限的方法, 并设计出一个更紧紧的假动力编程算法 。 然后, 将核查方法用于更低的界 。 此外, 为了评估安全半径以外的区域的稳健性, 我们从另一个角度重新审视稳健性 : 量化 。 引入一个具有严格统计保证的稳健性指标来衡量对抗性实例的量化值, 这表明模型容易在安全半径外受到扰动。 该指标有助于我们找出为什么像 BERT 这样的最先进的模型很容易被几个字替换, 但是在真实世界的噪音中一般化。