Dense retrieval, which describes the use of contextualised language models such as BERT to identify documents from a collection by leveraging approximate nearest neighbour (ANN) techniques, has been increasing in popularity. Two families of approaches have emerged, depending on whether documents and queries are represented by single or multiple embeddings. ColBERT, the exemplar of the latter, uses an ANN index and approximate scores to identify a set of candidate documents for each query embedding, which are then re-ranked using accurate document representations. In this manner, a large number of documents can be retrieved for each query, hindering the efficiency of the approach. In this work, we investigate the use of ANN scores for ranking the candidate documents, in order to decrease the number of candidate documents being fully scored. Experiments conducted on the MSMARCO passage ranking corpus demonstrate that, by cutting of the candidate set by using the approximate scores to only 200 documents, we can still obtain an effective ranking without statistically significant differences in effectiveness, and resulting in a 2x speedup in efficiency.
翻译:大量检索说明使用背景化语言模型,例如BERT,利用近邻(ANN)技术从收藏中查找文件,这种检索方式越来越受欢迎,出现了两组方法,这取决于文件和查询是否由单个或多个嵌入来代表。 ColBERT, 后者的范例,使用ANN指数和近似分数来确定每个插入的一套候选文件,然后使用准确的文件表述重新排序,这样,每个查询都可检索大量文件,这妨碍了方法的效率。我们调查使用ANN评分来排列候选文件,以便减少被完全评分的候选文件数量。在MSMARCO的排行榜上进行的实验表明,通过将大约的评分用于200份文件,我们通过将候选人的评分削减,仍然能够取得有效的排名,而不会在统计上出现显著的差异,并导致效率的2x加速。