The core of information retrieval (IR) is to identify relevant information from large-scale resources and return it as a ranked list to respond to the user's information need. In recent years, the resurgence of deep learning has greatly advanced this field and leads to a hot topic named NeuIR (i.e., neural information retrieval), especially the paradigm of pre-training methods (PTMs). Owing to sophisticated pre-training objectives and huge model size, pre-trained models can learn universal language representations from massive textual data, which are beneficial to the ranking task of IR. Recently, a large number of works, which are dedicated to the application of PTMs in IR, have been introduced to promote the retrieval performance. Considering the rapid progress of this direction, this survey aims to provide a systematic review of pre-training methods in IR. To be specific, we present an overview of PTMs applied in different components of an IR system, including the retrieval component, the re-ranking component, and other components. In addition, we also introduce PTMs specifically designed for IR, and summarize available datasets as well as benchmark leaderboards. Moreover, we discuss some open challenges and highlight several promising directions, with the hope of inspiring and facilitating more works on these topics for future research.
翻译:信息检索的核心是,从大规模资源中找出相关信息,并将这些信息作为排名清单归还,以满足用户的信息需求。近年来,深层学习的重新抬头大大推动了这一领域的发展,导致一个名为NeuIR(神经信息检索)的热题,特别是培训前方法的范式。由于培训前目标复杂,模型规模庞大,经过培训的模型可以从大量文本数据中学习通用语言表述,这些数据有助于IR的排名任务。最近,为了促进检索业绩,引进了大量专用于在IR应用PTM的作品。考虑到这一方向的迅速进展,这项调查的目的是系统地审查IR的培训前方法。具体地说,我们介绍在IR系统不同组成部分应用的PTM,包括检索部分、重新排序部分和其他部分。此外,我们还介绍了专门为IR公司设计的PTM,并概述了现有的数据集以及基准版。此外,我们与一些有希望的未来研究重点和一些有希望的课题讨论。