Recently, methods have been developed to improve the performance of dense passage retrieval by using context-supervised pre-training. These methods simply consider two passages from the same document to be relevant, without taking into account the possibility of weakly correlated pairs. Thus, this paper proposes query-as-context pre-training, a simple yet effective pre-training technique to alleviate the issue. Query-as-context pre-training assumes that the query derived from a passage is more likely to be relevant to that passage and forms a passage-query pair. These passage-query pairs are then used in contrastive or generative context-supervised pre-training. The pre-trained models are evaluated on large-scale passage retrieval benchmarks and out-of-domain zero-shot benchmarks. Experimental results show that query-as-context pre-training brings considerable gains and meanwhile speeds up training, demonstrating its effectiveness and efficiency. Our code will be available at https://github.com/caskcsg/ir/tree/main/cotmae-qc .
翻译:近期,研究人员开发了一些方法,通过使用上下文监督预训练来改善密集路径检索的性能。这些方法仅将来自同一文档的两个段落视为相关的,而没有考虑可能存在的弱相关对。因此,本文提出了查询为上下文预训练,这是一种简单而有效的预训练技术,用于缓解这个问题。查询作为上下文预训练假定从一个段落派生出的查询更可能与该段落相关,并形成一个段落-查询对。这些段落-查询对随后在对比或生成的上下文监督预训练中使用。在大规模段落检索基准测试和领域外零样本基准测试中评估预训练模型。实验结果表明,查询为上下文预训练带来了相当大的增益,并同时加速了训练,证明了它的效果和效率。我们的代码将在 https://github.com/caskcsg/ir/tree/main/cotmae-qc 上提供。