强力健身:统一 NLP 评价景观 (Robustness Gym: Unifying the NLP Evaluation Landscape)

Despite impressive performance on standard benchmarks, deep neural networks are often brittle when deployed in real-world systems. Consequently, recent research has focused on testing the robustness of such models, resulting in a diverse set of evaluation methodologies ranging from adversarial attacks to rule-based data transformations. In this work, we identify challenges with evaluating NLP systems and propose a solution in the form of Robustness Gym (RG), a simple and extensible evaluation toolkit that unifies 4 standard evaluation paradigms: subpopulations, transformations, evaluation sets, and adversarial attacks. By providing a common platform for evaluation, Robustness Gym enables practitioners to compare results from all 4 evaluation paradigms with just a few clicks, and to easily develop and share novel evaluation methods using a built-in set of abstractions. To validate Robustness Gym's utility to practitioners, we conducted a real-world case study with a sentiment-modeling team, revealing performance degradations of 18%+. To verify that Robustness Gym can aid novel research analyses, we perform the first study of state-of-the-art commercial and academic named entity linking (NEL) systems, as well as a fine-grained analysis of state-of-the-art summarization models. For NEL, commercial systems struggle to link rare entities and lag their academic counterparts by 10%+, while state-of-the-art summarization models struggle on examples that require abstraction and distillation, degrading by 9%+. Robustness Gym can be found at https://robustnessgym.com/

翻译：尽管在标准基准上取得了令人印象深刻的成绩,但深神经网络在现实世界系统中部署时往往会萎缩,因此,最近的研究侧重于测试这些模型的稳健性,从而形成一套多种多样的评价方法,从对抗性攻击到基于规则的数据转换,从对抗性攻击到基于规则的数据转换不等。在这项工作中,我们通过评价NLP系统确定挑战,并以强力健身(RG)这个简单和可扩展的评价工具包的形式提出解决办法,它统一了4种标准评价模式:亚集体、变换、评估集体和对抗性攻击。通过提供一个共同的评价平台,Robustness Gym使从业者能够将所有4个评价模式的结果与仅点击几下的结果进行比较,并轻松地开发并分享使用一套内在的抽象模型集的新的评价方法。为了验证Robustness Gym对从业人员的效用,我们与一个情感模范组一起进行了一个真实世界的案例研究,揭示了18 ⁇ 的性性表现退化。核查Robustness Gymm(Gym)能够帮助进行新式的研究分析,我们进行第一次研究,我们进行关于正态阵列的学术模型和学术模型作为Sal-al-al-al-al-al-lax-lax-lax-lax-lax-lax-lax-laxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx