Despite the progresses on pre-trained language models, there is a lack of unified frameworks for pre-trained sentence representation. As such, it calls for different pre-training methods for specific scenarios, and the pre-trained models are likely to be limited by their universality and representation quality. In this work, we extend the recently proposed MAE style pre-training strategy, RetroMAE, such that it may effectively support a wide variety of sentence representation tasks. The extended framework consists of two stages, with RetroMAE conducted throughout the process. The first stage performs RetroMAE over generic corpora, like Wikipedia, BookCorpus, etc., from which the base model is learned. The second stage takes place on domain-specific data, e.g., MS MARCO and NLI, where the base model is continuingly trained based on RetroMAE and contrastive learning. The pre-training outputs at the two stages may serve different applications, whose effectiveness are verified with comprehensive experiments. Concretely, the base model are proved to be effective for zero-shot retrieval, with remarkable performances achieved on BEIR benchmark. The continuingly pre-trained models further benefit more downstream tasks, including the domain-specific dense retrieval on MS MARCO, Natural Questions, and the sentence embeddings' quality for standard STS and transfer tasks in SentEval. The empirical insights of this work may inspire the future design of sentence representation pre-training. Our pre-trained models and source code will be released to the public communities.
翻译:尽管在经过培训的语文模式方面取得了进展,但缺乏关于经过培训的判刑说明的统一框架。因此,它要求对具体情景采用不同的培训前方法,而经过培训的模型可能因其普遍性和代表性质量而受到限制。在这项工作中,我们延长了最近提议的MAE风格培训前战略,即RetroMAE, 以便有效地支持范围广泛的各种判刑说明任务。扩展框架由两个阶段组成,在整个过程中进行RetroMAE。第一阶段是比通用的Corpora(如维基百科、BookCorpus等)进行RetroMAE,从中学习基础模型。第二阶段是在特定领域数据上进行,例如MS MARCO和NLI, 基础模型在RetroMAE和对比性学习的基础上继续接受培训。两个阶段的训练前产出可能起到不同的应用作用,其有效性经过全面试验核实。基准模型已证明对零光源检索有效,其基础模型在BIR基准上取得了显著的成绩。第二阶段是特定领域,即MARCO和NLILIL, 继续更新我们的标准格式,包括升级前的升级前和升级前的索引前的标准任务。