This paper presents a novel training method, Conditional Masked Language Modeling (CMLM), to effectively learn sentence representations on large scale unlabeled corpora. CMLM integrates sentence representation learning into MLM training by conditioning on the encoded vectors of adjacent sentences. Our English CMLM model achieves state-of-the-art performance on SentEval, even outperforming models learned using (semi-)supervised signals. As a fully unsupervised learning method, CMLM can be conveniently extended to a broad range of languages and domains. We find that a multilingual CMLM model co-trained with bitext retrieval~(BR) and natural language inference~(NLI) tasks outperforms the previous state-of-the-art multilingual models by a large margin. We explore the same language bias of the learned representations, and propose a principle component based approach to remove the language identifying information from the representation while still retaining sentence semantics.
翻译:本文介绍了一种新颖的培训方法,即有条件的隐蔽语言模型(CMLM),以有效学习大规模无标签的语体表达。CMLM将句义表述学习纳入 MLM 培训,以相邻句子的编码矢量为条件。我们的英语 CMLM 模型在SentEval上取得了最先进的表现,即使是使用(半)监督信号学习的优秀模型,也达到了最先进的表现。作为一种完全不受监督的学习方法,CMLM 也可以方便地推广到广泛的语言和领域。我们发现,多语言的 CMLM 模型同时受过比特检索~(BR)和自然语言推导-(NLI)的训练,任务大大超越了以前最先进的多语种模型。我们探索了所学的相同语言表达方式的偏差,并提出了一个基于原则组成部分的方法,在保留句语义语义时删除识别信息的语言。