Recently, it was shown that most popular IR measures are not interval-scaled, implying that decades of experimental IR research used potentially improper methods, which may have produced questionable results. However, it was unclear if and to what extent these findings apply to actual evaluations and this opened a debate in the community with researchers standing on opposite positions about whether this should be considered an issue (or not) and to what extent. In this paper, we first give an introduction to the representational measurement theory explaining why certain operations and significance tests are permissible only with scales of a certain level. For that, we introduce the notion of meaningfulness specifying the conditions under which the truth (or falsity) of a statement is invariant under permissible transformations of a scale. Furthermore, we show how the recall base and the length of the run may make comparison and aggregation across topics problematic. Then we propose a straightforward and powerful approach for turning an evaluation measure into an interval scale, and describe an experimental evaluation of the differences between using the original measures and the interval-scaled ones. For all the regarded measures - namely Precision, Recall, Average Precision, (Normalized) Discounted Cumulative Gain, Rank-Biased Precision and Reciprocal Rank - we observe substantial effects, both on the order of average values and on the outcome of significance tests. For the latter, previously significant differences turn out to be insignificant, while insignificant ones become significant. The effect varies remarkably between the tests considered but overall, on average, we observed a 25% change in the decision about which systems are significantly different and which are not.
翻译:最近,人们发现,大多数受欢迎的IR措施不是间隔式的,这意味着数十年的IRS实验性研究使用了可能不适当的方法,这可能会产生令人怀疑的结果。然而,这些结果是否以及在多大程度上适用于实际评估,尚不清楚这些结果是否以及在多大程度上适用于实际评估,这在社区中引发了一场辩论,研究人员持相反立场,认为这是否应被视为一个问题(或不)以及程度。在本文件中,我们首先介绍代表性衡量理论,解释为什么某些操作和重要性测试只允许以某种程度的尺度进行。为此,我们引入了有意义的概念,具体说明了在允许的比额表转换中,声明的真实性(或虚假性)在何种条件下是不易变的。此外,我们展示了回顾的基础和运行时间长度如何使各专题之间的比较和汇总有问题。然后,我们提出一个直接和有力的方法,将评价措施变成一个间隔尺度,并描述使用原有措施与间隔尺度的尺度之间的差别。我们所看到的所有措施,即精确的、平均精度的精确度、平均的精确度、总体价值之间的差别,而我们所观察到的排序之间的显著的等级是相当的等级,而我们所观察的等级的等级的等级的等级是重大的等级的。