信息检索中统计意义测试:对第一类、第二类和三类错误的经验分析 (Statistical Significance Testing in Information Retrieval: An Empirical Analysis of Type I, Type II and Type III Errors)

Statistical significance testing is widely accepted as a means to assess how well a difference in effectiveness reflects an actual difference between systems, as opposed to random noise because of the selection of topics. According to recent surveys on SIGIR, CIKM, ECIR and TOIS papers, the t-test is the most popular choice among IR researchers. However, previous work has suggested computer intensive tests like the bootstrap or the permutation test, based mainly on theoretical arguments. On empirical grounds, others have suggested non-parametric alternatives such as the Wilcoxon test. Indeed, the question of which tests we should use has accompanied IR and related fields for decades now. Previous theoretical studies on this matter were limited in that we know that test assumptions are not met in IR experiments, and empirical studies were limited in that we do not have the necessary control over the null hypotheses to compute actual Type I and Type II error rates under realistic conditions. Therefore, not only is it unclear which test to use, but also how much trust we should put in them. In contrast to past studies, in this paper we employ a recent simulation methodology from TREC data to go around these limitations. Our study comprises over 500 million p-values computed for a range of tests, systems, effectiveness measures, topic set sizes and effect sizes, and for both the 2-tail and 1-tail cases. Having such a large supply of IR evaluation data with full knowledge of the null hypotheses, we are finally in a position to evaluate how well statistical significance tests really behave with IR data, and make sound recommendations for practitioners.

翻译：根据最近对SIGIR、CIKM、ECIR和TOIS文件的调查, t-测试是IR研究人员最受欢迎的选择。然而,先前的工作表明,主要根据理论论点,计算机密集测试,如靴子陷阱或变相测试,主要根据理论论点进行。根据经验,其他人建议采用非参数性替代方法,如Wilcoxon测试。事实上,我们应使用哪些测试的问题已经伴随IR和相关领域数十年了。以前关于这个事项的理论研究有限,因为我们知道测试假设没有在IR实验中达到,而经验研究也有限,因为我们对于在现实条件下计算实际的I型和II型误差率的空虚假设没有必要的控制。因此,不仅不清楚要使用哪种测试,而且我们应该对其中多少信任。与过去的研究相比,我们最近对TRE数据和相关领域的模拟位置进行了评估,对数据质量进行了2级的精确度测试。我们的研究最终要用500万次的测算,对数据规模和精确的测算系统进行了数据测算。