In recent years, large pre-trained transformers have led to substantial gains in performance over traditional retrieval models and feedback approaches. However, these results are primarily based on the MS Marco/TREC Deep Learning Track setup, with its very particular setup, and our understanding of why and how these models work better is fragmented at best. We analyze effective BERT-based cross-encoders versus traditional BM25 ranking for the passage retrieval task where the largest gains have been observed, and investigate two main questions. On the one hand, what is similar? To what extent does the neural ranker already encompass the capacity of traditional rankers? Is the gain in performance due to a better ranking of the same documents (prioritizing precision)? On the other hand, what is different? Can it retrieve effectively documents missed by traditional systems (prioritizing recall)? We discover substantial differences in the notion of relevance identifying strengths and weaknesses of BERT that may inspire research for future improvement. Our results contribute to our understanding of (black-box) neural rankers relative to (well-understood) traditional rankers, help understand the particular experimental setting of MS-Marco-based test collections.
翻译:近年来,经过培训的大型变压器在传统检索模型和反馈方法的绩效方面取得了显著进步。然而,这些结果主要基于MS Marco/TREC深层学习跟踪装置及其非常特别的设置,以及我们对为什么和如何更好地运用这些模型的理解,充其量是零散的。我们分析的是有效的基于BERT的交叉编码器和传统的BB25等级,而传统的BB25等级对于通过检索任务而言,其相关性的观念是发现最大收益,并调查两个主要问题。一方面,什么相似?神经级的排级器在多大程度上已经包括了传统排级器的能力?由于对同一文件进行更好的排级(优先精确度)而取得绩效的收益吗?另一方面,有什么不同?它能否有效地检索到传统系统遗漏的文件(优先回顾)?我们发现在确定BERT的优势和弱点的关联性概念上有很大差异,这可能会激发研究今后的改进。我们的结果有助于我们了解(黑框)神经排级器相对于传统排级器(井底)传统排级器的强度,帮助理解MS-Marco测试收藏的特定实验设置。