Current state-of-the-art document retrieval solutions mainly follow an index-retrieve paradigm, where the index is hard to be directly optimized for the final retrieval target. In this paper, we aim to show that an end-to-end deep neural network unifying training and indexing stages can significantly improve the recall performance of traditional methods. To this end, we propose Neural Corpus Indexer (NCI), a sequence-to-sequence network that generates relevant document identifiers directly for a designated query. To optimize the recall performance of NCI, we invent a prefix-aware weight-adaptive decoder architecture, and leverage tailored techniques including query generation, semantic document identifiers, and consistency-based regularization. Empirical studies demonstrated the superiority of NCI on two commonly used academic benchmarks, achieving +17.6% and +16.8% relative enhancement for Recall@1 on NQ320k dataset and R-Precision on TriviaQA dataset, respectively, compared to the best baseline method.
翻译:目前最先进的文档检索解决方案主要遵循指数-检索模式,即索引很难直接优化用于最终检索目标。在本文中,我们的目标是显示一个端到端的深神经网络统一培训和指数化阶段可以大大改善传统方法的召回性能。为此,我们提议建立神经立体索引(NCI),一个从顺序到顺序的网络,直接为指定查询生成相关文件标识符。为了优化NCI的召回性能,我们发明了一个前缀-瓦重量适应性脱coder结构,并利用了定制技术,包括查询生成、语义文件识别器和基于一致性的正规化。根据经验进行的研究显示,NCI优于两种常用的学术基准,在NQ320k数据集和R-Pricision TriviaQA数据集上的重新调用@1,分别达到+17.6%和+16.8%的相对增强率,与最佳基线方法相比。