超越准确性: 使用 CheckList 对 NLP 模型的行为测试 (Beyond Accuracy: Behavioral Testing of NLP models with CheckList)

Although measuring held-out accuracy has been the primary approach to evaluate generalization, it often overestimates the performance of NLP models, while alternative approaches for evaluating models either focus on individual tasks or on specific behaviors. Inspired by principles of behavioral testing in software engineering, we introduce CheckList, a task-agnostic methodology for testing NLP models. CheckList includes a matrix of general linguistic capabilities and test types that facilitate comprehensive test ideation, as well as a software tool to generate a large and diverse number of test cases quickly. We illustrate the utility of CheckList with tests for three tasks, identifying critical failures in both commercial and state-of-art models. In a user study, a team responsible for a commercial sentiment analysis model found new and actionable bugs in an extensively tested model. In another user study, NLP practitioners with CheckList created twice as many tests, and found almost three times as many bugs as users without it.

翻译：虽然衡量 " 搁置 " 准确性是评价 " 概括性 " 的主要方法,但往往高估了 " 清单 " 模型的性能,而评价 " 清单 " 模型的替代方法则侧重于个别任务或具体行为。在软件工程中行为测试原则的启发下,我们引入了 " 核对列表 ",这是测试 " 清单 " 模型的一种任务不可知性方法。 " 核对列表 " 包括一个通用语言能力和测试类型矩阵,便于全面测试思维,以及一个软件工具,可以快速生成大量和多样的测试案例。我们展示了 " 核对列表 " 与测试三个任务的效用,查明了商业和最新模型的重大故障。在一项用户研究中,一个负责商业情绪分析模型的团队在广泛测试模型中发现了新的和可操作的错误。在另一项用户研究中,带有 " 核对列表 " 的NLP从业人员创造了两倍的测试,并发现几乎三倍于未进行测试的用户的错误。

相关内容

MoDELS

关注 43

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【快讯】ICML 2020论文出炉，1088篇上榜，你的paper中了吗？

专知会员服务

52+阅读 · 2020年6月1日

因果图，Causal Graphs，52页ppt

专知会员服务

250+阅读 · 2020年4月19日

【论文翻译】NLP注意力机制综述论文翻译，Attention, please! A Critical Review of Neural Attention Models in Natural Language Processing

专知会员服务

96+阅读 · 2020年4月18日

深度强化学习策略梯度教程，53页ppt

专知会员服务

184+阅读 · 2020年2月1日