Research community evaluations in information retrieval, such as NIST's Text REtrieval Conference (TREC), build reusable test collections by pooling document rankings submitted by many teams. Naturally, the quality of the resulting test collection thus greatly depends on the number of participating teams and the quality of their submitted runs. In this work, we investigate: i) how the number of participants, coupled with other factors, affects the quality of a test collection; and ii) whether the quality of a test collection can be inferred prior to collecting relevance judgments from human assessors. Experiments conducted on six TREC collections illustrate how the number of teams interacts with various other factors to influence the resulting quality of test collections. We also show that the reusability of a test collection can be predicted with high accuracy when the same document collection is used for successive years in an evaluation campaign, as is common in TREC.
翻译:信息检索方面的研究社区评价,例如NIST的文本检索会议(TREC),通过汇集许多小组提交的文件排名建立可重复使用的测试收藏;自然,由此产生的测试收藏的质量在很大程度上取决于参加小组的数目及其提交的运行质量;在这项工作中,我们调查:(一) 参加人数以及其他因素如何影响测试收藏的质量;和(二) 在收集人类评估员的相关判断之前,是否可以推断测试收藏的质量;对6个TREC收藏进行的实验表明,各小组的数目如何与各种其他因素相互作用,以影响由此产生的测试收藏的质量;我们还表明,如果在评价活动中连续使用同一文件收藏,可以非常准确地预测测试收藏的可恢复性,正如TREC所常见的那样。