Neural IR models have often been studied in homogeneous and narrow settings, which has considerably limited insights into their generalization capabilities. To address this, and to allow researchers to more broadly establish the effectiveness of their models, we introduce BEIR (Benchmarking IR), a heterogeneous benchmark for information retrieval. We leverage a careful selection of 17 datasets for evaluation spanning diverse retrieval tasks including open-domain datasets as well as narrow expert domains. We study the effectiveness of nine state-of-the-art retrieval models in a zero-shot evaluation setup on BEIR, finding that performing well consistently across all datasets is challenging. Our results show BM25 is a robust baseline and Reranking-based models overall achieve the best zero-shot performances, however, at high computational costs. In contrast, Dense-retrieval models are computationally more efficient but often underperform other approaches, highlighting the considerable room for improvement in their generalization capabilities. In this work, we extensively analyze different retrieval models and provide several suggestions that we believe may be useful for future work. BEIR datasets and code are available at https://github.com/UKPLab/beir.
翻译:为了解决这个问题,为了使研究人员能够更广泛地确立其模型的有效性,我们引入了一个信息检索的多样化基准BEIR(基准标志IR)。我们利用精心选择的17个数据集来评估包括开放域数据集在内的不同检索任务以及狭窄的专家领域。我们在BEIR的零弹式评价装置中研究了9个最先进的检索模型的有效性,发现在所有数据集中运行一致是具有挑战性的。我们的结果显示,BM25是一个强有力的基线,基于排名的模型总体上达到了最佳零弹性能,但计算成本很高。相比之下,Onse-Rererival模型的计算效率更高,但往往低于其他方法的完善程度。我们在此工作中,我们广泛分析了不同的检索模型,并提出了若干我们认为对未来工作可能有用的建议。BEIR数据集和代码可在 https://giubth./UKBABir/UKBABER.com查阅。