Multilingual information retrieval is challenging due to the lack of training datasets for many low-resource languages. We present an effective method by leveraging parallel and non-parallel corpora to improve the pretrained multilingual language models' cross-lingual transfer ability for information retrieval. We design the semantic contrastive loss as regular contrastive learning to improve the cross-lingual alignment of parallel sentence pairs, and we propose a new contrastive loss, the language contrastive loss, to leverage both parallel corpora and non-parallel corpora to further improve multilingual representation learning. We train our model on an English information retrieval dataset, and test its zero-shot transfer ability to other languages. Our experiment results show that our method brings significant improvement to prior work on retrieval performance, while it requires much less computational effort. Our model can work well even with a small number of parallel corpora. And it can be used as an add-on module to any backbone and other tasks. Our code is available at: https://github.com/xiyanghu/multilingualIR.
翻译:多语言信息检索之所以具有挑战性,是因为许多低资源语言缺乏培训数据集。我们通过利用平行和非平行的组合来利用平行和非平行的组合来提高预先训练的多语言模式的信息检索跨语言传输能力,提出了一种有效的方法。我们把语义对比损失设计成定期对比学习,以改进平行的对句的跨语言一致,我们提出一个新的对比性损失,即语言对比性损失,以利用平行的组合和非平行的组合来进一步改进多语言代表学习。我们用英语信息检索数据集来培训我们的模型,并测试其向其他语言零光传输的能力。我们的实验结果表明,我们的方法大大改进了先前的检索工作,而不需要做大量计算。我们的模型即使使用少量平行的组合,也能很好地发挥作用。它也可以用作任何主干和其他任务的附加模块。我们的代码可以在 https://github.com/xiyyanghu/multicloguyal IR 。