The Differentiable Search Index (DSI) is a new, emerging paradigm for information retrieval. Unlike traditional retrieval architectures where index and retrieval are two different and separate components, DSI uses a single transformer model to perform both indexing and retrieval. In this paper, we identify and tackle an important issue of current DSI models: the data distribution mismatch that occurs between the DSI indexing and retrieval processes. Specifically, we argue that, at indexing, current DSI methods learn to build connections between long document texts and their identifies, but then at retrieval, short query texts are provided to DSI models to perform the retrieval of the document identifiers. This problem is further exacerbated when using DSI for cross-lingual retrieval, where document text and query text are in different languages. To address this fundamental problem of current DSI models we propose a simple yet effective indexing framework for DSI called DSI-QG. In DSI-QG, documents are represented by a number of relevant queries generated by a query generation model at indexing time. This allows DSI models to connect a document identifier to a set of query texts when indexing, hence mitigating data distribution mismatches present between the indexing and the retrieval phases. Empirical results on popular mono-lingual and cross-lingual passage retrieval benchmark datasets show that DSI-QG significantly outperforms the original DSI model.
翻译:差异搜索索引(DSI)是信息检索方面新出现的新兴模式。与传统的检索结构不同,在传统检索结构中,索引和检索是两个不同和不同的组成部分,DSI使用单一变压器模型来进行索引和检索。在本文件中,我们确定并处理当前DSI模型的一个重要问题:DSI索引和检索进程之间出现的数据分布不匹配问题。具体地说,我们认为,在索引编制时,目前的DSI方法在长文件文本及其识别符号之间建立联系,但在检索时,向DSI模型提供简短的查询文本,以进行文件标识符的检索。在使用DSI进行跨语言检索时,这一问题会进一步恶化,因为文件文本和查询文本使用不同的语言。为了解决当前DSI模型的这一根本问题,我们建议为DSI索引和检索进程之间建立一个简单而有效的索引框架,称为DSI-QG。 在DSI-QG中,文件由一个原始生成模型产生的一些相关查询,从而DSI模型在索引编制时将文件标识与一套查询文本连接起来,从而减轻DSI的文本的文字检索过程。