IR 评估联合上下两下两期正常化 (Joint Upper & Lower Bound Normalization for IR Evaluation)

In this paper, we present a novel perspective towards IR evaluation by proposing a new family of evaluation metrics where the existing popular metrics (e.g., nDCG, MAP) are customized by introducing a query-specific lower-bound (LB) normalization term. While original nDCG, MAP etc. metrics are normalized in terms of their upper bounds based on an ideal ranked list, a corresponding LB normalization for them has not yet been studied. Specifically, we introduce two different variants of the proposed LB normalization, where the lower bound is estimated from a randomized ranking of the corresponding documents present in the evaluation set. We next conducted two case-studies by instantiating the new framework for two popular IR evaluation metric (with two variants, e.g., DCG_UL_V1,2 and MSP_UL_V1,2 ) and then comparing against the traditional metric without the proposed LB normalization. Experiments on two different data-sets with eight Learning-to-Rank (LETOR) methods demonstrate the following properties of the new LB normalized metric: 1) Statistically significant differences (between two methods) in terms of original metric no longer remain statistically significant in terms of Upper Lower (UL) Bound normalized version and vice-versa, especially for uninformative query-sets. 2) When compared against the original metric, our proposed UL normalized metrics demonstrate higher Discriminatory Power and better Consistency across different data-sets. These findings suggest that the IR community should consider UL normalization seriously when computing nDCG and MAP and more in-depth study of UL normalization for general IR evaluation is warranted.

翻译：在本文中,我们提出了一个新的评价指标体系,为IR评价提供了一种新视角,对IR评价提出了一种新观点,提出了一个新的评价指标体系,提出了一个新的评价指标体系,其中现有流行指标(例如, nDCG, MAP)通过引入一个自问特定较低(LB) 的标准化条件,定制了现有流行指标(例如, nDCG, MAP) 的定制(LB) 。虽然最初的 nDCG, MAP 等指标基于理想排序列表,根据其上限的上限标准实现了标准化,但尚未研究相应的 LB 正常化。具体地说,我们引入了两个不同的拟议 LB 正常化指标变量,其中较低的约束值来自评价组中相应文件的随机排序。我们接下来又进行了两次案例研究,通过对两个通用的通用指标(例如,DCG_UL_U_V1、2和MSP_UL_UL_V1, 2)的新框架框架进行同步化。与传统指标体系相比,没有拟议的LBBBS正常化。在八次学习-R-Rank (LAT) 方法中,显示新的LAL-R-Ral-R-Ration 显示新的L-RIBBBBBR 的以下的特性分析,新的L-比、比比比、比、比比、比、比、比、比值比值比值比值的比、比、比、比、比、比、比、比、比、比、比、比、比、比、比、比、比、比、比、比、比、比、比、比、比、比、比、比、比、比、比、比、比、比、比、比、比、比、比、比、比、比、比、比、比、比、比、比、比、比、比、比、比、比、比、比、比、比、比、比、比、比、比、比、比、比、比、比、比、比、比、比、比、比、比、比、比、比、比、比、比、比、比、比、比、比、比、比、比、比、比、比、比、比、比、比、比、比、比、比、比、