Recent IR approaches based on Pretrained Language Models (PLM) have now largely outperformed their predecessors on a variety of IR tasks. However, what happens to learned word representations with distribution shifts remains unclear. Recently, the BEIR benchmark was introduced to assess the performance of neural rankers in zero-shot settings and revealed deficiencies for several models. In complement to BEIR, we propose to control \textit{explicitly} distribution shifts. We selected different query subsets leading to different distribution shifts: short versus long queries, wh-words types of queries and 5 topic-based clusters. Then, we benchmarked state of the art neural rankers such as dense Bi-Encoder, SPLADE and ColBERT under these different training and test conditions. Our study demonstrates that it is possible to design distribution shift experiments within the MSMARCO collection, and that the query subsets we selected constitute an additional benchmark to better study factors of generalization for various models.
翻译:最近根据预先培训的语言模式(PLM)制定的IR方法,在各种IR任务方面,现在基本上比其前身的成绩要好。然而,随着分布变化,所学到的字表达方式仍然不清楚。最近,引入了BERIR基准来评估零光环境中神经排层的性能,并揭示了若干模型的缺陷。作为BIR的补充,我们提议对分配变化加以控制。我们选择了导致分配变化的不同查询子集:短问与长问、wh字查询类型和5个专题分类组。然后,我们根据这些不同的培训和测试条件对诸如密集的Bi-Encorder、SALADE和ColBERT等艺术神经排层的状态进行了基准测量。我们的研究显示,有可能在MSMARCO的收集中设计分布转移实验,我们选择的查询子集构成一个额外的基准,以便更好地研究各种模型的概括因素。