Neural Information Retrieval models hold the promise to replace lexical matching models, e.g. BM25, in modern search engines. While their capabilities have fully shone on in-domain datasets like MS MARCO, they have recently been challenged on out-of-domain zero-shot settings (BEIR benchmark), questioning their actual generalization capabilities compared to bag-of-words approaches. Particularly, we wonder if these shortcomings could (partly) be the consequence of the inability of neural IR models to perform lexical matching off-the-shelf.In this work, we propose a measure of discrepancy between the lexical matching performed by any (neural) model and an 'ideal' one. Based on this, we study the behavior of different state-of-the-art neural IR models, focusing on whether they are able to perform lexical matching when it's actually useful, i.e. for important terms. Overall, we show that neural IR models fail to properly generalize term importance on out-of-domain collections or terms almost unseen during training
翻译:神经信息检索模型有望在现代搜索引擎中取代词汇匹配模型,例如BM25。虽然它们的能力在MS MARCO这样的内部数据集中已经完全显露出来,但最近却在外部零射设置(BEIR基准)上受到挑战,质疑它们与一袋字方法相比的实际通用能力。特别是,我们怀疑这些缺陷是否(部分)可能是神经IR模型无法进行词汇匹配的结果。在这项工作中,我们建议衡量任何(神经)模型和“理想”模型所执行的词汇匹配之间的差异。基于这一点,我们研究不同状态神经IR模型的行为,侧重于它们是否能够在实际有用时进行词汇匹配,即重要术语。总体而言,我们表明神经IR模型未能适当概括外部收藏或培训期间几乎看不见的术语对外部收藏或术语的重要性。