Despite their recent popularity and well-known advantages, dense retrievers still lag behind sparse methods such as BM25 in their ability to reliably match salient phrases and rare entities in the query and to generalize to out-of-domain data. It has been argued that this is an inherent limitation of dense models. We rebut this claim by introducing the Salient Phrase Aware Retriever (SPAR), a dense retriever with the lexical matching capacity of a sparse model. We show that a dense Lexical Model {\Lambda} can be trained to imitate a sparse one, and SPAR is built by augmenting a standard dense retriever with {\Lambda}. Empirically, SPAR shows superior performance on a range of tasks including five question answering datasets, MS MARCO passage retrieval, as well as the EntityQuestions and BEIR benchmarks for out-of-domain evaluation, exceeding the performance of state-of-the-art dense and sparse retrievers. The code and models of SPAR are available at: https://github.com/facebookresearch/dpr-scale/tree/main/spar
翻译:尽管最近广受欢迎和众所周知的优势,但密集的检索者仍然落后于诸如BM25等稀少的方法,例如他们能够可靠地匹配查询中的突出短语和稀有实体,并能够概括地概括数据。有人争辩说,这是密度模型的内在局限性。我们通过采用精密的检索者,即具有稀有模型的词汇匹配能力的密集检索者,来反驳这一说法。我们表明,可以训练密集的Lexical Model {Lambda}来模仿稀有的模型,而SPAR则通过增加标准的密集检索器来建立。Smbda}。简而言之,SPAR展示了一系列任务的优异性表现,包括回答数据集的五个问题、MS MARCO通道检索,以及实体问题和BEIR外部评估基准,超过了国家技术密集和稀少的检索者的性能。SPAR的代码和模型可在以下网址查阅:https://githerub.com/facebreadresearch/dprassal/tree/main/maine/maine/main。