Although neural information retrieval has witnessed great improvements, recent works showed that the generalization ability of dense retrieval models on target domains with different distributions is limited, which contrasts with the results obtained with interaction-based models. To address this issue, researchers have resorted to adversarial learning and query generation approaches; both approaches nevertheless resulted in limited improvements. In this paper, we propose to use a self-supervision approach in which pseudo-relevance labels are automatically generated on the target domain. To do so, we first use the standard BM25 model on the target domain to obtain a first ranking of documents, and then use the interaction-based model T53B to re-rank top documents. We further combine this approach with knowledge distillation relying on an interaction-based teacher model trained on the source domain. Our experiments reveal that pseudo-relevance labeling using T53B and the MiniLM teacher performs on average better than other approaches and helps improve the state-of-the-art query generation approach GPL when it is fine-tuned on the pseudo-relevance labeled data.
翻译:虽然神经信息检索工作有了很大的改进,但最近的工作表明,在分布不同的目标领域密集检索模型的普及能力有限,这与基于互动的模式取得的结果形成对比。为解决这一问题,研究人员采用了对抗性学习和询问生成方法;但这两种方法都取得了有限的改进。在本文件中,我们提议采用自我监督方法,在目标领域自动生成假相关性标签。为此,我们首先在目标领域使用标准的BM25模型获得文件的第一排名,然后使用基于互动的T53B模型重排顶级文件。我们进一步将这一方法与依赖在源领域培训的基于互动的教师模型进行知识蒸馏结合起来。我们的实验显示,使用T53B和MiniLM教师的假相关性标签比其他方法平均表现更好,并在对假相关性标签数据进行微调时帮助改进州级查询生成方法GPL。