信息检索中的批量评价计量:措施、尺度和意义 (Batch Evaluation Metrics in Information Retrieval: Measures, Scales, and Meaning)

A sequence of recent papers has considered the role of measurement scales in information retrieval (IR) experimentation, and presented the argument that (only) uniform-step interval scales should be used, and hence that well-known metrics such as reciprocal rank, expected reciprocal rank, normalized discounted cumulative gain, and average precision, should be either discarded as measurement tools, or adapted so that their metric values lie at uniformly-spaced points on the number line. These papers paint a rather bleak picture of past decades of IR evaluation, at odds with the community's overall emphasis on practical experimentation and measurable improvement. Our purpose in this work is to challenge that position. In particular, we argue that mappings from categorical and ordinal data to sets of points on the number line are valid provided there is an external reason for each target point to have been selected. We first consider the general role of measurement scales, and of categorical, ordinal, interval, ratio, and absolute data collections. In connection with the first two of those categories we also provide examples of the knowledge that is captured and represented by numeric mappings to the real number line. Focusing then on information retrieval, we argue that document rankings are categorical data, and that the role of an effectiveness metric is to provide a single value that represents the usefulness to a user or population of users of any given ranking, with usefulness able to be represented as a continuous variable on a ratio scale. That is, we argue that current IR metrics are well-founded, and, moreover, that those metrics are more meaningful in their current form than in the proposed "intervalized" versions.

翻译：最近一系列论文审议了测量尺度在信息检索实验中的作用,并提出了这样的论点,即(仅)应当使用统一步骤的间隔尺度,因此,对等等级、预期对等等级、正常折扣累积收益和平均精确等众所周知的指标,要么作为衡量工具丢弃,要么调整,使其衡量尺度值位于数字线上的统一空格点。这些文件描绘了过去几十年IR评估的相当暗淡的画面,与社区全面强调实际试验和可衡量的改进不相符。我们这项工作的目的是挑战这一位置。特别是,我们认为,从直线和正序数据到数字线上各组点的绘图是有效的,前提是每个目标点都具有外部原因。我们首先考虑测量尺度的一般作用,以及绝对、中间、比例和绝对数据收集的一般作用。与最初两个拟议类别相比,我们还提供了通过数字绘图采集和反映真实数字线的知识的例子。我们随后在信息检索中注重直线和正态数据与直线值的比例上,我们说,在当前的用户排名中,我们更清楚地指出,从一个绝对值到任何精确的比值的比值的比值,我们更能提供准确的比值。