Dense retrieval approaches can overcome the lexical gap and lead to significantly improved search results. However, they require large amounts of training data which is not available for most domains. As shown in previous work (Thakur et al., 2021b), the performance of dense retrievers severely degrades under a domain shift. This limits the usage of dense retrieval approaches to only a few domains with large training datasets. In this paper, we propose the novel unsupervised domain adaptation method Generative Pseudo Labeling (GPL), which combines a query generator with pseudo labeling from a cross-encoder. On six representative domain-specialized datasets, we find the proposed GPL can outperform an out-of-the-box state-of-the-art dense retrieval approach by up to 8.9 points nDCG@10. GPL requires less (unlabeled) data from the target domain and is more robust in its training than previous methods. We further investigate the role of six recent pre-training methods in the scenario of domain adaptation for retrieval tasks, where only three could yield improved results. The best approach, TSDAE (Wang et al., 2021) can be combined with GPL, yielding another average improvement of 1.0 points nDCG@10 across the six tasks.
翻译:高密度检索方法可以克服词汇上的空白,并导致显著改进搜索结果。 但是,它们需要大量培训数据,而大多数领域都不具备这些数据。正如以前的工作(Thakur等人,2021b)所示,密集检索器的性能在一个域变换下严重降解。这限制了密集检索方法的使用,仅局限于几个有大型培训数据集的少数领域。在本文件中,我们建议采用新的、不受监督的域适应方法 生成出样化化(GPL),该方法将查询生成器与跨编码器的假标签结合起来。在六个具有代表性的域专门数据集中,我们发现拟议的GPL可以比箱外状态的密集检索方法高出8.9个百分点 nDCG@10。 GPL要求目标领域较少(未加标签的)数据,而且其培训比以往方法更加有力。我们进一步调查了最近六种培训前方法在检索任务域适应方案中的作用,其中只有三种能够产生改进的结果。