Evaluation in Information Retrieval relies on post-hoc empirical procedures, which are time-consuming and expensive operations. To alleviate this, Query Performance Prediction (QPP) models have been developed to estimate the performance of a system without the need for human-made relevance judgements. Such models, usually relying on lexical features from queries and corpora, have been applied to traditional sparse IR methods - with various degrees of success. With the advent of neural IR and large Pre-trained Language Models, the retrieval paradigm has significantly shifted towards more semantic signals. In this work, we study and analyze to what extent current QPP models can predict the performance of such systems. Our experiments consider seven traditional bag-of-words and seven BERT-based IR approaches, as well as nineteen state-of-the-art QPPs evaluated on two collections, Deep Learning '19 and Robust '04. Our findings show that QPPs perform statistically significantly worse on neural IR systems. In settings where semantic signals are prominent (e.g., passage retrieval), their performance on neural models drops by as much as 10% compared to bag-of-words approaches. On top of that, in lexical-oriented scenarios, QPPs fail to predict performance for neural IR systems on those queries where they differ from traditional approaches the most.
翻译:在信息检索中,评估依赖于事后经验程序,这些程序耗时费时费钱。为了缓解这种情况,开发了查询性业绩预测(QPP)模型,以评估一个不需要人为相关判断的系统的业绩。这些模型通常依赖查询和公司查询的词汇特征,通常适用于传统的稀疏的IR方法,取得了不同程度的成功。随着神经IR的出现和受过训练的大型语言模型的出现,检索模式已明显转向更传统的语义信号。在这项工作中,我们研究和分析目前的QPP模型能够在多大程度上预测这些系统的业绩。我们的实验考虑了7个传统的词包和基于BERT的7个IR方法,以及19年深学习和Robust '04年两种收藏中评估的19种最先进的QPP。我们的研究结果显示,QPP系统在统计学上表现明显恶化。在语义信号突出的环境下(e.g.通过检索),其大多数在神经模型上的性能表现都与10种前程相比。