A query performance predictor estimates the retrieval effectiveness of an IR system for a given query. An important characteristic of QPP evaluation is that, since the ground truth retrieval effectiveness for QPP evaluation can be measured with different metrics, the ground truth itself is not absolute, which is in contrast to other retrieval tasks, such as that of ad-hoc retrieval. Motivated by this argument, the objective of this paper is to investigate how such variances in the ground truth for QPP evaluation can affect the outcomes of QPP experiments. We consider this not only in terms of the absolute values of the evaluation metrics being reported (e.g. Pearson's $r$, Kendall's $\tau$), but also with respect to the changes in the ranks of different QPP systems when ordered by the QPP metric scores. Our experiments reveal that the observed QPP outcomes can vary considerably, both in terms of the absolute evaluation metric values and also in terms of the relative system ranks. Through our analysis, we report the optimal combinations of QPP evaluation metric and experimental settings that are likely to lead to smaller variations in the observed results.
翻译:查询性能预测器估计了某一查询的IR系统的检索效力。QPP评价的一个重要特点是,由于QPP评价的地面真相检索效力可以用不同的衡量尺度来衡量,因此地面真相本身不是绝对的,这与其他检索任务(例如临时检索)形成对照。根据这个论点,本文件的目的是调查QPP评价的地面真相差异如何影响QPP试验的结果。我们认为,这不仅是因为所报告的评价指标的绝对值(例如Pearson的$$,Kendall的$\tau$),而且还因为按照QPP指标的评分,不同的QPP系统等级的变化可能较小。我们的实验表明,观察到的QPP结果在绝对评价指标值和相对系统等级方面可能有很大差异。我们通过我们的分析,报告QP评价指标和试验环境的最佳组合,这可能导致所观察的结果变化较小。