The Differentiable Search Index (DSI) is an emerging paradigm for information retrieval. Unlike traditional retrieval architectures where index and retrieval are two different and separate components, DSI uses a single transformer model to perform both indexing and retrieval. In this paper, we identify and tackle an important issue of current DSI models: the data distribution mismatch that occurs between the DSI indexing and retrieval processes. Specifically, we argue that, at indexing, current DSI methods learn to build connections between the text of long documents and the identifier of the documents, but then retrieval of document identifiers is based on queries that are commonly much shorter than the indexed documents. This problem is further exacerbated when using DSI for cross-lingual retrieval, where document text and query text are in different languages. To address this fundamental problem of current DSI models, we propose a simple yet effective indexing framework for DSI, called DSI-QG. When indexing, DSI-QG represents documents with a number of potentially relevant queries generated by a query generation model and re-ranked and filtered by a cross-encoder ranker. The presence of these queries at indexing allows the DSI models to connect a document identifier to a set of queries, hence mitigating data distribution mismatches present between the indexing and the retrieval phases. Empirical results on popular mono-lingual and cross-lingual passage retrieval datasets show that DSI-QG significantly outperforms the original DSI model.
翻译:不同的搜索指数(DSI)是信息检索的新兴范例。与传统的检索结构不同,在传统的检索结构中,索引和检索是两个不同和不同的组成部分,不同的是,索引和检索是两个不同和不同的组成部分,DSI使用单一的变压器模型来进行索引和检索。在本文中,我们确定并处理当前DSI模型的一个重要问题:DSI索引和检索进程之间出现的数据分配不匹配。具体地说,我们争辩说,在索引编制时,目前的DSI方法在长文件文本和文件标识标识符号之间建立联系,但随后检索文件标识符号则以通常比索引文件更短得多的查询为基础。在使用DSI进行跨语言检索时,问题就更加严重。在使用DSI进行跨语言检索时,文件文本和查询文本以不同语言提供。为了解决当前DSI模型的这一根本问题,我们为DSI索引和检索进程提议一个简单而有效的DSI-Q。在索引编制时,DSI-QG是文件,它代表一些可能相关的查询,由原始的生成生成模型产生,而交叉阅读者重新排序和过滤者过滤。