Pre-trained Language Models have recently emerged in Information Retrieval as providing the backbone of a new generation of neural systems that outperform traditional methods on a variety of tasks. However, it is still unclear to what extent such approaches generalize in zero-shot conditions. The recent BEIR benchmark provides partial answers to this question by comparing models on datasets and tasks that differ from the training conditions. We aim to address the same question by comparing models under more explicit distribution shifts. To this end, we build three query-based distribution shifts within MS MARCO (query-semantic, query-intent, query-length), which are used to evaluate the three main families of neural retrievers based on BERT: sparse, dense, and late-interaction -- as well as a monoBERT re-ranker. We further analyse the performance drops between the train and test query distributions. In particular, we experiment with two generalization indicators: the first one based on train/test query vocabulary overlap, and the second based on representations of a trained bi-encoder. Intuitively, those indicators verify that the further away the test set is from the train one, the worse the drop in performance. We also show that models respond differently to the shifts -- dense approaches being the most impacted. Overall, our study demonstrates that it is possible to design more controllable distribution shifts as a tool to better understand generalization of IR models. Finally, we release the MS MARCO query subsets, which provide an additional resource to benchmark zero-shot transfer in Information Retrieval.
翻译:最近,在信息检索中出现了预先培训的语言模型,这是新一代神经系统的骨干,这些神经系统在各种任务上优于传统方法。然而,仍然不清楚这些方法在多大程度上在零发条件下普遍推广。最近的BIR基准通过比较数据集模型和任务与培训条件不同的模型,为这一问题提供了部分答案。我们的目标是通过比较在更明确的分布变化下比较模型,解决同样的问题。为此,我们在MS MARCO内部建立了三个基于查询的分布变化(拼字、查询内容、查询长度),用来评价基于BERT的神经检索器的三个主要基准系列:稀少、密集和晚间互动 -- -- 以及单BERT的重新排序。我们进一步分析火车和测试查询分布之间的性能下降。我们特别要试验两个概括性指标:第一个基于培训/测试的词汇重叠,第二个基于经过培训的双coder的演示。更直观,这些指标更清晰地证实,最接近于测试结果的MAR值分布模式的更深层次变化,从一个测试工具显示我们最深层的升级到更深层分析工具。