Recent research demonstrates the effectiveness of using pretrained language models (PLM) to improve dense retrieval and multilingual dense retrieval. In this work, we present a simple but effective monolingual pretraining task called contrastive context prediction~(CCP) to learn sentence representation by modeling sentence level contextual relation. By pushing the embedding of sentences in a local context closer and pushing random negative samples away, different languages could form isomorphic structure, then sentence pairs in two different languages will be automatically aligned. Our experiments show that model collapse and information leakage are very easy to happen during contrastive training of language model, but language-specific memory bank and asymmetric batch normalization operation play an essential role in preventing collapsing and information leakage, respectively. Besides, a post-processing for sentence embedding is also very effective to achieve better retrieval performance. On the multilingual sentence retrieval task Tatoeba, our model achieves new SOTA results among methods without using bilingual data. Our model also shows larger gain on Tatoeba when transferring between non-English pairs. On two multi-lingual query-passage retrieval tasks, XOR Retrieve and Mr.TYDI, our model even achieves two SOTA results in both zero-shot and supervised setting among all pretraining models using bilingual data.
翻译:最近的研究显示,使用预先培训的语言模型(PLM)来改进密集检索和多语种密集检索的有效性。在这项工作中,我们提出了一个简单而有效的单一语言预培训任务,称为对比背景预测~(CCP),通过模拟判决级别背景关系学习句号代表。通过将判决嵌入本地环境,将随机负面样本推离更近,不同的语言可以形成不形态结构,然后用两种不同语言对判刑配对自动调整。我们的实验表明,模型崩溃和信息泄漏在比较化的语言模型培训中非常容易发生,但语言专用记忆库和不对称批次正常化操作在防止崩溃和信息泄漏方面分别发挥着不可或缺的作用。此外,为嵌入判刑进行后处理对于取得更好的检索性能也非常有效。在多语种句检索任务Tatoeba中,我们的模型可以在不使用双语数据的方法中取得新的SOTA结果。我们的模型还显示,在非英语对子之间转让时,Tatoeba的收益更大。在两个多语种查询访问访问检索任务中, XORReireveve和MTyDIDI中,我们的模型甚至都使用了两个双语数据库。