The dramatic improvements in core information retrieval tasks engendered by neural rankers create a need for novel evaluation methods. If every ranker returns highly relevant items in the top ranks, it becomes difficult to recognize meaningful differences between them and to build reusable test collections. Several recent papers explore pairwise preference judgments as an alternative to traditional graded relevance assessments. Rather than viewing items one at a time, assessors view items side-by-side and indicate the one that provides the better response to a query, allowing fine-grained distinctions. If we employ preference judgments to identify the probably best items for each query, we can measure rankers by their ability to place these items as high as possible. We frame the problem of finding best items as a dueling bandits problem. While many papers explore dueling bandits for online ranker evaluation via interleaving, they have not been considered as a framework for offline evaluation via human preference judgments. We review the literature for possible solutions. For human preference judgments, any usable algorithm must tolerate ties, since two items may appear nearly equal to assessors, and it must minimize the number of judgments required for any specific pair, since each such comparison requires an independent assessor. Since the theoretical guarantees provided by most algorithms depend on assumptions that are not satisfied by human preference judgments, we simulate selected algorithms on representative test cases to provide insight into their practical utility. Based on these simulations, one algorithm stands out for its potential. Our simulations suggest modifications to further improve its performance. Using the modified algorithm, we collect over 10,000 preference judgments for submissions to the TREC 2021 Deep Learning Track, confirming its suitability.
翻译:由神经排层产生的核心信息检索任务的巨大改进产生了对新评价方法的需要。如果每个排层的排名都返回了高度相关的项目,那么就很难认识到它们之间存在有意义的差异,并且建立可重复使用的测试收藏。最近的一些论文探讨了对等偏好判断,以替代传统的分级相关性评估。不是一次看项目,评估者对项目进行旁观,而是指出对查询提供更好反应的文献,允许细微区分。如果我们使用偏好判断来确定每个查询可能的最佳项目,那么我们就可以用它们的能力来测量排层排层的排名,从而尽可能地将这些项目放在最高层中。我们把找到最佳项目的问题看成是分级匪的问题。虽然许多论文探索通过互换式评分来进行在线排层评级评估,但并没有将这些项目视为通过人类偏好判断进行离线评估的框架。我们审查文献以寻找可能的解决办法。关于人类偏好判断,任何可用的算法都必须容忍联系,因为两个项目可能看起来接近评估者,并且必须尽可能减少任何具体对等级的评分数。我们将最佳的评分标数标设定最佳项目,因为每次比较都要求进行独立的估测算,因此需要通过独立的估测测算,因此我们能够提供最精确的测算。