Dense retrievers have made significant strides in text retrieval and open-domain question answering, even though most achievements were made possible only with large amounts of human supervision. In this work, we aim to develop unsupervised methods by proposing two methods that create pseudo query-document pairs and train dense retrieval models in an annotation-free and scalable manner: query extraction and transferred query generation. The former method produces pseudo queries by selecting salient spans from the original document. The latter utilizes generation models trained for other NLP tasks (e.g., summarization) to produce pseudo queries. Extensive experiments show that models trained with the proposed augmentation methods can perform comparably well (or better) to multiple strong baselines. Combining those strategies leads to further improvements, achieving the state-of-the-art performance of unsupervised dense retrieval on both BEIR and ODQA datasets.
翻译:大量检索者在文本检索和开放域问题解答方面取得了长足的进步,尽管大多数成就只有在大量的人力监督下才有可能实现。在这项工作中,我们的目标是制定不受监督的方法,提出两种方法来创建假的查询文档配对,并以无注释和可缩放的方式培训密集检索模型:查询提取和传输查询生成。前一种方法通过从原始文档中选择突出的空格产生伪查询。后一种方法利用经过培训的用于其他国家实验室任务(如汇总)的生成模型来生成假查询。广泛的实验表明,经过培训的增强方法模型能够很好(或更好)地运行到多个强大的基线。合并这些战略可以带来进一步的改进,在BEIR和ODQA数据集上实现无监控密度检索的最先进性能。</s>