The information retrieval community has recently witnessed a revolution due to large pretrained transformer models. Another key ingredient for this revolution was the MS MARCO dataset, whose scale and diversity has enabled zero-shot transfer learning to various tasks. However, not all IR tasks and domains can benefit from one single dataset equally. Extensive research in various NLP tasks has shown that using domain-specific training data, as opposed to a general-purpose one, improves the performance of neural models. In this work, we harness the few-shot capabilities of large pretrained language models as synthetic data generators for IR tasks. We show that models finetuned solely on our unsupervised dataset outperform strong baselines such as BM25 as well as recently proposed self-supervised dense retrieval methods. Furthermore, retrievers finetuned on both supervised and our synthetic data achieve better zero-shot transfer than models finetuned only on supervised data. Code, models, and data are available at https://github.com/zetaalphavector/inpars .
翻译:信息检索社区最近目睹了一场革命,因为大型的预先培训变压器模型。这场革命的另一个关键要素是MS MARCO数据集,该数据集的规模和多样性使得零光传输学习到各种任务中。然而,并非所有IR的任务和领域都能平等地从单一数据集中受益。对各种国家实验室任务的广泛研究表明,使用具体领域的培训数据,而不是一般用途数据,可以改善神经模型的性能。在这项工作中,我们利用作为IR任务的合成数据生成器的大型预先培训语言模型的微小能力。我们显示,模型只微调了我们未经监督的数据集的强大基线,例如BM25和最近提出的自我监督的密集检索方法。此外,对监督的和我们的合成数据进行微调的检索器比仅对受监督的数据进行微调的模型更能实现零光传输。可在https://github.com/zetaalphactor/ inpars上查阅。