Pre-trained models have demonstrated superior power on many important tasks. However, it is still an open problem of designing effective pre-training strategies so as to promote the models' usability on dense retrieval. In this paper, we propose a novel pre-training framework for dense retrieval based on the Masked Auto-Encoder, known as RetroMAE. Our proposed framework is highlighted for the following critical designs: 1) a MAE based pre-training workflow, where the input sentence is polluted on both encoder and decoder side with different masks, and original sentence is reconstructed based on both sentence embedding and masked sentence; 2) asymmetric model architectures, with a large-scale expressive transformer for sentence encoding and a extremely simplified transformer for sentence reconstruction; 3) asymmetric masking ratios, with a moderate masking on the encoder side (15%) and an aggressive masking ratio on the decoder side (50~90%). We pre-train a BERT like encoder on English Wikipedia and BookCorpus, where it notably outperforms the existing pre-trained models on a wide range of dense retrieval benchmarks, like MS MARCO, Open-domain Question Answering, and BEIR.
翻译:培训前的模型在很多重要任务上表现出了超强的力量。 但是,在设计有效的培训前战略以促进模型在密集检索方面的可用性方面,这仍然是设计有效的培训前战略的一个未决问题。 在本文件中,我们提议了一个基于隐蔽自动编码器(RetroMAE)进行密集检索的新的培训前框架。我们提议的框架用于以下关键设计:(1) 以MAE为基础的培训前工作流程,其中输入的句子被不同面具的编码器和解码器污染,最初的句子都是根据嵌入和遮盖的句子来重建的;(2) 不对称的模型结构,配有用于句子编码的大规模直观变换器和极简化的变换器;(3) 不对称的掩码率,在编码器侧面有中度的遮罩(15%),在解码器侧有强烈的遮罩率(50-90%)。 我们预先将一个像英文维基百科和BookCorpus的编码器那样的BEERTER, 其中它明显超越了在大量密集检索基准上的现有预先训练的模型,例如MSMARCO、O-mamamamamaine Regy。