Recently, InPars introduced a method to efficiently use large language models (LLMs) in information retrieval tasks: via few-shot examples, an LLM is induced to generate relevant queries for documents. These synthetic query-document pairs can then be used to train a retriever. However, InPars and, more recently, Promptagator, rely on proprietary LLMs such as GPT-3 and FLAN to generate such datasets. In this work we introduce InPars-v2, a dataset generator that uses open-source LLMs and existing powerful rerankers to select synthetic query-document pairs for training. A simple BM25 retrieval pipeline followed by a monoT5 reranker finetuned on InPars-v2 data achieves new state-of-the-art results on the BEIR benchmark. To allow researchers to further improve our method, we open source the code, synthetic data, and finetuned models: https://github.com/zetaalphavector/inPars/tree/master/tpu
翻译:最近,InPars在信息检索任务中引入了高效使用大语言模型的方法(LLM)的信息检索任务中:通过几个例子,引导LLM生成相关文件查询。这些合成查询文件配对可以用来培训检索器。然而,InPars和最近更近的Pernagator, 依靠诸如GPT-3和FLAN等专有LMs(LLLMs)生成这类数据集。在这项工作中,我们引入了InPars-v2, 一个数据集生成器,它使用开放源LMs和现有强大的重新排序器选择用于培训的合成查询文件配对。一个简单的BM25检索管道,然后对InPars-v2数据进行单调一T5重新排序,在BEIR基准上取得了新的最新结果。为了让研究人员进一步改进我们的方法,我们打开了代码、合成数据以及微调模型:https://github.com/zetaalphavertor/inPars/tre/k/tree/tal/truppu。