多种语文的《感官检索》培训前 (Unsupervised Context Aware Sentence Representation Pretraining for Multi-lingual Dense Retrieval)

Recent research demonstrates the effectiveness of using pretrained language models (PLM) to improve dense retrieval and multilingual dense retrieval. In this work, we present a simple but effective monolingual pretraining task called contrastive context prediction~(CCP) to learn sentence representation by modeling sentence level contextual relation. By pushing the embedding of sentences in a local context closer and pushing random negative samples away, different languages could form isomorphic structure, then sentence pairs in two different languages will be automatically aligned. Our experiments show that model collapse and information leakage are very easy to happen during contrastive training of language model, but language-specific memory bank and asymmetric batch normalization operation play an essential role in preventing collapsing and information leakage, respectively. Besides, a post-processing for sentence embedding is also very effective to achieve better retrieval performance. On the multilingual sentence retrieval task Tatoeba, our model achieves new SOTA results among methods without using bilingual data. Our model also shows larger gain on Tatoeba when transferring between non-English pairs. On two multi-lingual query-passage retrieval tasks, XOR Retrieve and Mr.TYDI, our model even achieves two SOTA results in both zero-shot and supervised setting among all pretraining models using bilingual data.

翻译：最近的研究显示,使用预先培训的语言模型(PLM)来改进密集检索和多语种密集检索的有效性。在这项工作中,我们提出了一个简单而有效的单一语言预培训任务,称为对比背景预测~(CCP),通过模拟判决级别背景关系学习句号代表。通过将判决嵌入本地环境,将随机负面样本推离更近,不同的语言可以形成不形态结构,然后用两种不同语言对判刑配对自动调整。我们的实验表明,模型崩溃和信息泄漏在比较化的语言模型培训中非常容易发生,但语言专用记忆库和不对称批次正常化操作在防止崩溃和信息泄漏方面分别发挥着不可或缺的作用。此外,为嵌入判刑进行后处理对于取得更好的检索性能也非常有效。在多语种句检索任务Tatoeba中,我们的模型可以在不使用双语数据的方法中取得新的SOTA结果。我们的模型还显示,在非英语对子之间转让时,Tatoeba的收益更大。在两个多语种查询访问访问检索任务中, XORReireveve和MTyDIDI中,我们的模型甚至都使用了两个双语数据库。

相关内容

MoDELS

关注 43

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

高效可扩展图神经网络的研究进展，Recent Advances in Efficient and Scalable Graph Neural Networks

专知会员服务

78+阅读 · 2022年3月15日

【CVPR2020】在线深度聚类的无监督表示学习, Online Deep Clustering for Unsupervised Representation Learning

专知会员服务

69+阅读 · 2020年6月19日

50+篇《神经架构搜索NAS》2020论文合集

专知会员服务

61+阅读 · 2020年3月19日

【Google ICLR2020论文】嵌入式大规模检索的预训练任务，Pre-training Tasks for Embedding-based Large-scale Retrieval

专知会员服务

28+阅读 · 2020年2月12日