The development of modern NLP applications often relies on various benchmark datasets containing plenty of manually labeled tests to evaluate performance. While constructing datasets often costs many resources, the performance on the held-out data may not properly reflect their capability in real-world application scenarios and thus cause tremendous misunderstanding and monetary loss. To alleviate this problem, in this paper, we propose an automated test generation method for detecting erroneous behaviors of various NLP applications. Our method is designed based on the sentence parsing process of classic linguistics, and thus it is capable of assembling basic grammatical elements and adjuncts into a grammatically correct test with proper oracle information. We implement this method into NLPLego, which is designed to fully exploit the potential of seed sentences to automate the test generation. NLPLego disassembles the seed sentence into the template and adjuncts and then generates new sentences by assembling context-appropriate adjuncts with the template in a specific order. Unlike the taskspecific methods, the tests generated by NLPLego have derivation relations and different degrees of variation, which makes constructing appropriate metamorphic relations easier. Thus, NLPLego is general, meaning it can meet the testing requirements of various NLP applications. To validate NLPLego, we experiment with three common NLP tasks, identifying failures in four state-of-art models. Given seed tests from SQuAD 2.0, SST, and QQP, NLPLego successfully detects 1,732, 5301, and 261,879 incorrect behaviors with around 95.7% precision in three tasks, respectively.
翻译:现代 NLP 应用程序的开发往往依赖于各种基准数据集, 其中包括大量人工标记的测试来评估性能。 在构建数据集时往往花费很多资源, 搁置数据的性能可能无法恰当地反映其在现实世界应用情景下的能力, 从而造成巨大误解和货币损失。 为了缓解这一问题, 在本文件中, 我们提议了一种自动测试生成方法, 用于检测各种 NLP 应用程序的错误行为。 我们的方法是根据经典语言的句子区分过程设计的, 因此它能够将基本的语法元素和辅助元素整合成一个有正确信息的语法正确测试。 我们将这种方法应用到NLPL go 中, 旨在充分利用种子句的可能性将测试生成过程自动化。 NPLego 将种子句拆卸到模板和辅助器中, 然后通过将适合环境的附加点与模板相配, 与任务特定方法不同, NLPL 7 生成的测试, 和 NLPL 3 的测试将常规关系和不同程度。