Many information retrieval tasks require large labeled datasets for fine-tuning. However, such datasets are often unavailable, and their utility for real-world applications can diminish quickly due to domain shifts. To address this challenge, we develop and motivate a method for using large language models (LLMs) to generate large numbers of synthetic queries cheaply. The method begins by generating a small number of synthetic queries using an expensive LLM. After that, a much less expensive one is used to create large numbers of synthetic queries, which are used to fine-tune a family of reranker models. These rerankers are then distilled into a single efficient retriever for use in the target domain. We show that this technique boosts zero-shot accuracy in long-tail domains, even where only 2K synthetic queries are used for fine-tuning, and that it achieves substantially lower latency than standard reranking methods. We make our end-to-end approach, including our synthetic datasets and replication code, publicly available on Github.
翻译:许多信息检索任务需要大量的标签数据集才能进行微调。 但是,这类数据集往往没有,而且由于域变,它们对于真实世界应用的效用会因域变换而迅速减少。 为了应对这一挑战,我们开发并激励一种方法,使用大型语言模型(LLMs)来廉价地生成大量合成查询。该方法首先使用昂贵的LLM来生成少量合成查询。之后,一个成本低得多的方法被用来生成大量合成查询,用来微调一组重置模型。这些重置器随后被蒸馏成一个单一高效的检索器,供目标域使用。我们显示,即使只有2K合成查询用于微调,这种技术也能提高长尾域的零弹射精确度,而且它比标准重排法的惯性要低得多。 我们把终端到终端的方法,包括我们的合成数据集和复制代码,公布在Githhub上。</s>