Recently multi-lingual pre-trained language models (PLM) such as mBERT and XLM-R have achieved impressive strides in cross-lingual dense retrieval. Despite its successes, they are general-purpose PLM while the multilingual PLM tailored for cross-lingual retrieval is still unexplored. Motivated by an observation that the sentences in parallel documents are approximately in the same order, which is universal across languages, we propose to model this sequential sentence relation to facilitate cross-lingual representation learning. Specifically, we propose a multilingual PLM called masked sentence model (MSM), which consists of a sentence encoder to generate the sentence representations, and a document encoder applied to a sequence of sentence vectors from a document. The document encoder is shared for all languages to model the universal sequential sentence relation across languages. To train the model, we propose a masked sentence prediction task, which masks and predicts the sentence vector via a hierarchical contrastive loss with sampled negatives. Comprehensive experiments on four cross-lingual retrieval tasks show MSM significantly outperforms existing advanced pre-training models, demonstrating the effectiveness and stronger cross-lingual retrieval capabilities of our approach. Code and model will be available.
翻译:最近多语言的预先培训语言模式(PLM),如MBERT和XLM-R等,在跨语言密集检索方面取得了令人印象深刻的长足进步。尽管取得了成功,但是它们是通用的PLM,而用于跨语言检索的多语种的PLM尚未开发。由于观察到平行文件中的句子大致相同,而且跨语言的普及性,我们提议以这种顺序句子为模式,便利跨语言的教学。具体地说,我们提议采用一种多语言的PLM(MMSM),称为蒙面的句子模式(MSM),其中包括一个句子编码器,用来生成句子表达,以及一个用于从文档中排列的句子矢量序列的文件编码器。所有语言都共用了文件编码器,以模拟跨语言的普遍顺序顺序的句子关系。为了培训模型,我们建议采用隐含句子的预测任务,用抽样反差分级来掩盖和预测句子矢量。关于四种跨语言的检索任务的全面实验显示MSMMM明显超越了现有的高级培训前模型,表明我们方法的有效性和较强的跨语言检索能力。代码和模型将可供使用。