Recent multilingual pre-trained models have shown better performance in various multilingual tasks. However, these models perform poorly on multilingual retrieval tasks due to lacking multilingual training data. In this paper, we propose to mine and generate self-supervised training data based on a large-scale unlabeled corpus. We carefully design a mining method which combines the sparse and dense models to mine the relevance of unlabeled queries and passages. And we introduce a query generator to generate more queries in target languages for unlabeled passages. Through extensive experiments on Mr. TYDI dataset and an industrial dataset from a commercial search engine, we demonstrate that our method performs better than baselines based on various pre-trained multilingual models. Our method even achieves on-par performance with the supervised method on the latter dataset.
翻译:最近,多语言预训练模型在各种多语言任务上表现出更好的性能。 然而,由于缺乏多语言训练数据,这些模型在多语言检索任务上表现较差。 在本文中,我们提出了基于大型未标记语料库进行自监督训练数据挖掘和生成的方法。 我们精心设计了一种挖掘方法,将稀疏模型和密集模型相结合,挖掘未标记查询和文章的相关性。并且我们引入了一个查询生成器,在目标语言的未标记文章中生成更多的查询。通过在Mr. TYDI数据集和来自商业搜索引擎的工业数据集上进行大量实验,我们证明了我们的方法比基于各种预训练的多语言模型的基准方法的性能更好。在后一个数据集上,我们的方法甚至达到了监督方法的性能。