Dense retrievers for open-domain question answering (ODQA) have been shown to achieve impressive performance by training on large datasets of question-passage pairs. We investigate whether dense retrievers can be learned in a self-supervised fashion, and applied effectively without any annotations. We observe that existing pretrained models for retrieval struggle in this scenario, and propose a new pretraining scheme designed for retrieval: recurring span retrieval. We use recurring spans across passages in a document to create pseudo examples for contrastive learning. The resulting model -- Spider -- performs surprisingly well without any examples on a wide range of ODQA datasets, and is competitive with BM25, a strong sparse baseline. In addition, Spider often outperforms strong baselines like DPR trained on Natural Questions, when evaluated on questions from other datasets. Our hybrid retriever, which combines Spider with BM25, improves over its components across all datasets, and is often competitive with in-domain DPR models, which are trained on tens of thousands of examples.
翻译:开放域问题解答( ODQA) 的常量检索器显示,通过对大量问题访问对等数据集的培训,可以取得令人印象深刻的成绩。 我们调查是否能够以自我监督的方式学习密集检索器,并在没有任何说明的情况下有效地应用。 我们观察到,在这种情景中,现有的为检索斗争而预先培训的模式,并提出了一个新的检索前培训计划:重复的跨度检索。我们在一份文件中使用反复的跨段来创建假的对比性学习实例。 所产生的模型 -- -- 蜘蛛 -- -- 表现得令人惊讶,没有关于ODQA广泛数据集的任何实例,而且与非常稀少的基线BM25具有竞争力。 此外,在对其他数据集的问题进行评估时,蜘蛛往往比DPR所训练的关于自然问题的强基线要强。 我们的混合检索器将蜘蛛和BM25组合在一起,改善了所有数据集的组件,并且常常与内部DPR模型竞争,这些模型是成千上万个实例的培训。