Information retrieval is an important component in natural language processing, for knowledge intensive tasks such as question answering and fact checking. Recently, information retrieval has seen the emergence of dense retrievers, based on neural networks, as an alternative to classical sparse methods based on term-frequency. These models have obtained state-of-the-art results on datasets and tasks where large training sets are available. However, they do not transfer well to new domains or applications with no training data, and are often outperformed by term-frequency methods such as BM25 which are not supervised. Thus, a natural question is whether it is possible to train dense retrievers without supervision. In this work, we explore the limits of contrastive learning as a way to train unsupervised dense retrievers, and show that it leads to strong retrieval performance. More precisely, we show on the BEIR benchmark that our model outperforms BM25 on 11 out of 15 datasets. Furthermore, when a few thousands examples are available, we show that fine-tuning our model on these leads to strong improvements compared to BM25. Finally, when used as pre-training before fine-tuning on the MS-MARCO dataset, our technique obtains state-of-the-art results on the BEIR benchmark.
翻译:在自然语言处理中,信息检索是自然语言处理、知识密集型任务(如问答和事实检查)的一个重要部分。最近,信息检索发现在神经网络的基础上出现了密集的检索器,作为基于定期频率的经典稀疏方法的替代方法。这些模型在数据集和大型培训成套任务方面获得了最先进的结果。然而,它们没有顺利地转移到没有培训数据的新领域或应用程序,而且往往比没有监督的BM25等术语频率方法表现得更好。因此,一个自然的问题是,能否在没有监督的情况下培训密集的检索器。在这项工作中,我们探索对比学习的局限性,作为培训不受监督的密集检索器的一种方法,并表明它能够带来很强的检索性能。更准确地说,我们在BER基准基准中显示,我们的模型在15个数据集中的11个方面超过了BM25。此外,如果有数千个例子,我们展示了这些模型的微调模式与B25相比,将会带来强大的改进。最后,当在对BM25进行精准之前,我们用于对BIS-CO基准数据设置技术进行预先培训时,我们获得的MS-CO-M-MAR数据库数据。