This paper presents a novel training method, Conditional Masked Language Modeling (CMLM), to effectively learn sentence representations on large scale unlabeled corpora. CMLM integrates sentence representation learning into MLM training by conditioning on the encoded vectors of adjacent sentences. Our English CMLM model achieves state-of-the-art performance on SentEval, even outperforming models learned using supervised signals. As a fully unsupervised learning method, CMLM can be conveniently extended to a broad range of languages and domains. We find that a multilingual CMLM model co-trained with bitext retrieval (BR) and natural language inference (NLI) tasks outperforms the previous state-of-the-art multilingual models by a large margin, e.g. 10% improvement upon baseline models on cross-lingual semantic search. We explore the same language bias of the learned representations, and propose a simple, post-training and model agnostic approach to remove the language identifying information from the representation while still retaining sentence semantics.
翻译:本文介绍了一种新型培训方法,即有条件遮蔽语言模型(CMLM),以有效学习大规模无标签公司(CMLM)的句号表述。CMLM将句号表述学习纳入MLM培训,以相邻句子的编码矢量为条件。我们的英文CMLM模型在SentEval上实现了最新水平的性能,即使是使用监督信号学习的优异模式。作为一种完全不受监督的学习方法,CMLM可以方便地推广到广泛的语言和领域。我们发现,多语言的CMLM模型通过比特检索(BR)和自然语言推论(NLI)共同培训,任务大大超越了以前最先进的多语种模式,例如跨语种语系搜索基准模型改进了10%。我们探讨了所学的语种表达方法的相同语言偏差,并提出了一种简单、后培训和示范性分析方法,以删除代表语种中识别信息的语文,同时保留句语义语义。