Dense retrievers have made significant strides in obtaining state-of-the-art results on text retrieval and open-domain question answering (ODQA). Yet most of these achievements were made possible with the help of large annotated datasets, unsupervised learning for dense retrieval models remains an open problem. In this work, we explore two categories of methods for creating pseudo query-document pairs, named query extraction (QExt) and transferred query generation (TQGen), to augment the retriever training in an annotation-free and scalable manner. Specifically, QExt extracts pseudo queries by document structures or selecting salient random spans, and TQGen utilizes generation models trained for other NLP tasks (e.g., summarization) to produce pseudo queries. Extensive experiments show that dense retrievers trained with individual augmentation methods can perform comparably well with multiple strong baselines, and combining them leads to further improvements, achieving state-of-the-art performance of unsupervised dense retrieval on both BEIR and ODQA datasets.
翻译:大量检索者在获取文本检索和开放域答题(ODQA)的最新结果方面取得了长足的进步。然而,大多数这些成就都是在大量附加注释的数据集的帮助下取得的,对密集检索模型的未经监督的学习仍然是一个尚未解决的问题。在这项工作中,我们探索了两类方法来创建假的查询文件配对,名为查询提取(QExt)和传输查询生成(TQGen),以无注释和可缩放的方式加强检索者培训。具体来说,QExt通过文件结构或选择突出的随机光谱提取假查询,TQGen利用为其他NLP任务(如合成)培训的生成模型来生成假查询。广泛的实验表明,经过个人增强方法培训的密度检索者可以与多个强大的基线相匹配,并将它们结合起来可以进一步改进,在BEI和ODQA数据集上实现未加控制的密集检索的最新性能。