In monolingual dense retrieval, lots of works focus on how to distill knowledge from cross-encoder re-ranker to dual-encoder retriever and these methods achieve better performance due to the effectiveness of cross-encoder re-ranker. However, we find that the performance of the cross-encoder re-ranker is heavily influenced by the number of training samples and the quality of negative samples, which is hard to obtain in the cross-lingual setting. In this paper, we propose to use a query generator as the teacher in the cross-lingual setting, which is less dependent on enough training samples and high-quality negative samples. In addition to traditional knowledge distillation, we further propose a novel enhancement method, which uses the query generator to help the dual-encoder align queries from different languages, but does not need any additional parallel sentences. The experimental results show that our method outperforms the state-of-the-art methods on two benchmark datasets.
翻译:在单语言稠密检索中,很多工作都关注如何从交叉编码器再排名器中提取知识,并且由于交叉编码器再排名器的有效性,这些方法实现了更好的性能。然而,我们发现交叉编码器再排名器的性能严重受到训练样本数量和负样本质量的影响,在跨语言设置中很难获得。在本文中,我们提出了使用查询生成器作为跨语言设置中的教师,这种方法对足够的训练样本和高质量的负样本依赖性较小。除了传统的知识蒸馏之外,我们进一步提出了一种新的增强方法,它使用查询生成器帮助双编码器对不同语言的查询进行对齐,但不需要任何额外的平行句子。实验结果表明,我们的方法在两个基准数据集上都优于现有最先进方法。