This paper outlines a conceptual framework for understanding recent developments in information retrieval and natural language processing that attempts to integrate dense and sparse retrieval methods. I propose a representational approach that breaks the core text retrieval problem into a logical scoring model and a physical retrieval model. The scoring model is defined in terms of encoders, which map queries and documents into a representational space, and a comparison function that computes query-document scores. The physical retrieval model defines how a system produces the top-k scoring documents from an arbitrarily large corpus with respect to a query. The scoring model can be further analyzed along two dimensions: dense vs. sparse representations and supervised (learned) vs. unsupervised approaches. I show that many recently proposed retrieval methods, including multi-stage ranking designs, can be seen as different parameterizations in this framework, and that a unified view suggests a number of open research questions, providing a roadmap for future work. As a bonus, this conceptual framework establishes connections to sentence similarity tasks in natural language processing and information access "technologies" prior to the dawn of computing.
翻译:本文概述了理解信息检索和自然语言处理方面最新动态的概念框架,该概念框架试图将密集和稀少的检索方法结合起来。我提议一种代表式方法,将核心文本检索问题打破逻辑评分模型和物理检索模型。评分模型的定义是编码器,将查询和文件映射成代表空间,以及计算查询文件评分的比较功能。物理检索模型界定了一个系统如何从任意的大程序上产生与查询有关的最高评分文档。评分模型可以进一步从两个方面加以分析:密集与稀少的表述和受监督的(学习)与不受监督的方法。我表明,最近提出的许多检索方法,包括多阶段排名设计,可以被视为这个框架中的不同参数化,统一的观点可以提出一些开放的研究问题,为今后的工作提供一个路线图。作为一种红利,这个概念框架在计算黎明之前,在自然语言处理和信息存取“技术”方面,建立了相似的句法联系。