Recent literature has seen growing interest in using black-box strategies like CheckList for testing the behavior of NLP models. Research on white-box testing has developed a number of methods for evaluating how thoroughly the internal behavior of deep models is tested, but they are not applicable to NLP models. We propose a set of white-box testing methods that are customized for transformer-based NLP models. These include Mask Neuron Coverage (MNCOVER) that measures how thoroughly the attention layers in models are exercised during testing. We show that MNCOVER can refine testing suites generated by CheckList by substantially reduce them in size, for more than 60\% on average, while retaining failing tests -- thereby concentrating the fault detection power of the test suite. Further we show how MNCOVER can be used to guide CheckList input generation, evaluate alternative NLP testing methods, and drive data augmentation to improve accuracy.
翻译:最近的文献显示,人们越来越有兴趣使用黑箱战略,如用于测试NLP模型的行为的 " 检查列表 " 等。白箱测试研究开发了一些方法,用以评估深度模型的内部行为是如何彻底测试的,但不适用于NLP模型。我们提出了一套为基于变压器的NLP模型量身定制的白色箱测试方法。这些方法包括:测量模型中的注意层在测试过程中的完整度。我们表明,MNCOVER可以改进由CRList生成的测试套件,将其平均大大缩小60%以上,同时保留失败测试 -- -- 从而集中测试套件的故障检测能力。我们进一步展示了如何使用MNCOVERV来指导检查列表输入生成,评估替代 NLP测试方法,并驱动数据增强,以提高准确性。