This paper argues for the widest possible use of bootstrap confidence intervals for comparing NLP system performances instead of the state-of-the-art status (SOTA) and statistical significance testing. Their main benefits are to draw attention to the difference in performance between two systems and to help assessing the degree of superiority of one system over another. Two cases studies, one comparing several systems and the other based on a K-fold cross-validation procedure, illustrate these benefits. A python module for obtaining these confidence intervals as well as a second function implementing the Fisher-Pitman test for paired samples are freely available on PyPi.
翻译:本文主张尽可能最广泛地利用“靴带”信任间隔来比较NLP系统性能,而不是最先进的状态和统计意义测试,其主要好处是提请注意两个系统性能的差别,帮助评估一个系统优于另一个系统的程度,两个案例研究,一个比较几个系统,另一个根据K倍交叉校验程序进行比较,说明这些好处。在PyPi可免费获得获取这些信任间隔的Python模块,另一个功能是实施渔业-Pitman对配对样品的测试。