The core of information retrieval (IR) is to identify relevant information from large-scale resources and return it as a ranked list to respond to the user's information need. Recently, the resurgence of deep learning has greatly advanced this field and leads to a hot topic named NeuIR (i.e., neural information retrieval), especially the paradigm of pre-training methods (PTMs). Owing to sophisticated pre-training objectives and huge model size, pre-trained models can learn universal language representations from massive textual data, which are beneficial to the ranking task of IR. Since there have been a large number of works dedicating to the application of PTMs in IR, we believe it is the right time to summarize the current status, learn from existing researches, and gain some insights for future development. In this survey, we present an overview of PTMs applied in different components of an IR system, including the retrieval component, the re-ranking component, and other components. In addition, we also introduce PTMs specifically designed for IR, and summarize available datasets as well as benchmark leaderboards. Moreover, we discuss some open challenges and envision some promising directions, with the hope of inspiring more works on these topics for future research.
翻译:信息检索的核心是,从大规模资源中找出相关信息,并将这些信息作为应对用户信息需要的排名清单予以归还。最近,深层学习的恢复大大推进了这个领域,导致一个名为NeuIR(神经信息检索)的热题,特别是培训前方法范式(PTMs),由于培训前目标复杂,模型规模庞大,培训前模型可以从大量文本数据中学习通用语言表述,这些数据有助于IR的排名任务。由于在IR应用PTM方面已有大量著作,我们认为现在正是总结现状、从现有研究中学习并为未来发展获取一些见解的适当时机。在这项调查中,我们概述了在IR系统不同组成部分应用的PTMs,包括检索部分、重新排序部分和其他部分。此外,我们还介绍了专门为IR公司设计的PTMs,并总结了现有数据集和基准领导板。此外,我们讨论了一些公开的挑战,并设想了一些前景良好的未来研究主题,希望这些研究将带来更多希望。