Although measuring held-out accuracy has been the primary approach to evaluate generalization, it often overestimates the performance of NLP models, while alternative approaches for evaluating models either focus on individual tasks or on specific behaviors. Inspired by principles of behavioral testing in software engineering, we introduce CheckList, a task-agnostic methodology for testing NLP models. CheckList includes a matrix of general linguistic capabilities and test types that facilitate comprehensive test ideation, as well as a software tool to generate a large and diverse number of test cases quickly. We illustrate the utility of CheckList with tests for three tasks, identifying critical failures in both commercial and state-of-art models. In a user study, a team responsible for a commercial sentiment analysis model found new and actionable bugs in an extensively tested model. In another user study, NLP practitioners with CheckList created twice as many tests, and found almost three times as many bugs as users without it.
翻译:虽然衡量 " 搁置 " 准确性是评价 " 概括性 " 的主要方法,但往往高估了 " 清单 " 模型的性能,而评价 " 清单 " 模型的替代方法则侧重于个别任务或具体行为。在软件工程中行为测试原则的启发下,我们引入了 " 核对列表 ",这是测试 " 清单 " 模型的一种任务不可知性方法。 " 核对列表 " 包括一个通用语言能力和测试类型矩阵,便于全面测试思维,以及一个软件工具,可以快速生成大量和多样的测试案例。我们展示了 " 核对列表 " 与测试三个任务的效用,查明了商业和最新模型的重大故障。在一项用户研究中,一个负责商业情绪分析模型的团队在广泛测试模型中发现了新的和可操作的错误。在另一项用户研究中,带有 " 核对列表 " 的NLP从业人员创造了两倍的测试,并发现几乎三倍于未进行测试的用户的错误。