The recently proposed capability-based NLP testing allows model developers to test the functional capabilities of NLP models, revealing functional failures that cannot be detected by the traditional heldout mechanism. However, existing work on capability-based testing requires extensive manual efforts and domain expertise in creating the test cases. In this paper, we investigate a low-cost approach for the test case generation by leveraging the GPT-3 engine. We further propose to use a classifier to remove the invalid outputs from GPT-3 and expand the outputs into templates to generate more test cases. Our experiments show that TestAug has three advantages over the existing work on behavioral testing: (1) TestAug can find more bugs than existing work; (2) The test cases in TestAug are more diverse; and (3) TestAug largely saves the manual efforts in creating the test suites. The code and data for TestAug can be found at our project website (https://guanqun-yang.github.io/testaug/) and GitHub (https://github.com/guanqun-yang/testaug).
翻译:最近提议的基于能力的NLP测试使模型开发者能够测试NLP模型的功能能力,揭示出传统抑制机制无法检测到的功能性故障。然而,基于能力测试的现有工作需要大量人工努力和创建测试案例的域域专长。在本文件中,我们利用GPT-3引擎对测试案例生成的低成本方法进行调查。我们进一步提议使用一个分类器从GPT-3中去除无效产出,并将产出扩展为模板,以生成更多的测试案例。我们的实验表明,TestAug比现有行为测试工作具有三个优势:(1)TestAug能够发现比现有工作更多的错误;(2)TestAug中的测试案例更为多样化;(3)TestAug的测试案例在很大程度上节省了创建测试套件的手工工作。TestAug的代码和数据可以在我们的项目网站(https://guanqun-yang.githuuu.io/stataug/)和GitHub(https://github.com/guanqun-yang/steaug)上。