为改善跨语言的密度检索而建模顺序判决关系模型 (Modeling Sequential Sentence Relation to Improve Cross-lingual Dense Retrieval)

Recently multi-lingual pre-trained language models (PLM) such as mBERT and XLM-R have achieved impressive strides in cross-lingual dense retrieval. Despite its successes, they are general-purpose PLM while the multilingual PLM tailored for cross-lingual retrieval is still unexplored. Motivated by an observation that the sentences in parallel documents are approximately in the same order, which is universal across languages, we propose to model this sequential sentence relation to facilitate cross-lingual representation learning. Specifically, we propose a multilingual PLM called masked sentence model (MSM), which consists of a sentence encoder to generate the sentence representations, and a document encoder applied to a sequence of sentence vectors from a document. The document encoder is shared for all languages to model the universal sequential sentence relation across languages. To train the model, we propose a masked sentence prediction task, which masks and predicts the sentence vector via a hierarchical contrastive loss with sampled negatives. Comprehensive experiments on four cross-lingual retrieval tasks show MSM significantly outperforms existing advanced pre-training models, demonstrating the effectiveness and stronger cross-lingual retrieval capabilities of our approach. Code and model will be available.

翻译：最近多语言的预先培训语言模式(PLM),如MBERT和XLM-R等,在跨语言密集检索方面取得了令人印象深刻的长足进步。尽管取得了成功,但是它们是通用的PLM,而用于跨语言检索的多语种的PLM尚未开发。由于观察到平行文件中的句子大致相同,而且跨语言的普及性,我们提议以这种顺序句子为模式,便利跨语言的教学。具体地说,我们提议采用一种多语言的PLM(MMSM),称为蒙面的句子模式(MSM),其中包括一个句子编码器,用来生成句子表达,以及一个用于从文档中排列的句子矢量序列的文件编码器。所有语言都共用了文件编码器,以模拟跨语言的普遍顺序顺序的句子关系。为了培训模型,我们建议采用隐含句子的预测任务,用抽样反差分级来掩盖和预测句子矢量。关于四种跨语言的检索任务的全面实验显示MSMMM明显超越了现有的高级培训前模型,表明我们方法的有效性和较强的跨语言检索能力。代码和模型将可供使用。

相关内容

MoDELS

关注 43

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

Linux导论，Introduction to Linux，96页ppt

专知会员服务

81+阅读 · 2020年7月26日

20篇「ACL2020」最新论文抢先看！看自然语言处理2020在研究什么？

专知会员服务

97+阅读 · 2020年4月10日

图像分类技巧集，17页ppt《Bag of Tricks for Image Classification》

专知会员服务

95+阅读 · 2020年3月12日

【跨语言BERT模型大集合】Transfer learning is increasingly going multilingual with language-specific BERT models

专知会员服务

54+阅读 · 2020年1月30日