Statistical significance testing centered on p-values is commonly used to compare NLP system performance, but p-values alone are insufficient because statistical significance differs from practical significance. The latter can be measured by estimating effect size. In this paper, we propose a three-stage procedure for comparing NLP system performance and provide a toolkit, NLPStatTest, that automates the process. Users can upload NLP system evaluation scores and the toolkit will analyze these scores, run appropriate significance tests, estimate effect size, and conduct power analysis to estimate Type II error. The toolkit provides a convenient and systematic way to compare NLP system performance that goes beyond statistical significance testing
翻译:以p-value为核心的统计意义测试通常用于比较NLP系统性能,但单是p-value本身是不够的,因为统计意义与实际意义不同,后者可以通过估计影响大小来衡量。在本文中,我们提出一个三阶段程序,用于比较NLP系统性能,并提供工具箱NLPSTATTTTest,使过程自动化。用户可以上传NLP系统性能评分,工具包将分析这些分数,进行适当的意义测试,估计影响大小,并进行能力分析,以估计第二类错误。工具包提供了方便和系统的方法,比较NLP系统性能,超越了统计意义测试。