Offline evaluation of information retrieval and recommendation has traditionally focused on distilling the quality of a ranking into a scalar metric such as average precision or normalized discounted cumulative gain. We can use this metric to compare the performance of multiple systems for the same request. Although evaluation metrics provide a convenient summary of system performance, they also collapse subtle differences across users into a single number and can carry assumptions about user behavior and utility not supported across retrieval scenarios. We propose recall-paired preference (RPP), a metric-free evaluation method based on directly computing a preference between ranked lists. RPP simulates multiple user subpopulations per query and compares systems across these pseudo-populations. Our results across multiple search and recommendation tasks demonstrate that RPP substantially improves discriminative power while correlating well with existing metrics and being equally robust to incomplete data.
翻译:对信息检索的离线评价和建议历来侧重于提炼升至标度质量的质量,如平均精确度或正常的折扣累积收益。我们可以使用这一指标来比较多个系统对同一请求的绩效。虽然评价指标提供了方便的系统性能概要,但它们也使用户之间的细微差异破碎成一个单一数字,并可以提出用户行为和效用的假设,这些假设在检索的各种假设中都得不到支持。我们建议采用回溯式优先选择(RPPP),这是一种基于直接计算排名清单之间偏好的无标准评价方法。RPP为每个查询模拟多个用户子群,并比较这些假冒人口群之间的系统。我们跨越多重搜索和建议任务的结果表明,RPP大大改进了歧视性力量,同时与现有的指标密切相关,而且与不完整的数据同样强大。