Dense retrieval has shown promising results in many information retrieval (IR) related tasks, whose foundation is high-quality text representation learning for effective search. Some recent studies have shown that autoencoder-based language models are able to boost the dense retrieval performance using a weak decoder. However, we argue that 1) it is not discriminative to decode all the input texts and, 2) even a weak decoder has the bypass effect on the encoder. Therefore, in this work, we introduce a novel contrastive span prediction task to pre-train the encoder alone, but still retain the bottleneck ability of the autoencoder. % Therefore, in this work, we propose to drop out the decoder and introduce a novel contrastive span prediction task to pre-train the encoder alone. The key idea is to force the encoder to generate the text representation close to its own random spans while far away from others using a group-wise contrastive loss. In this way, we can 1) learn discriminative text representations efficiently with the group-wise contrastive learning over spans and, 2) avoid the bypass effect of the decoder thoroughly. Comprehensive experiments over publicly available retrieval benchmark datasets show that our approach can outperform existing pre-training methods for dense retrieval significantly.
翻译:大量检索在许多信息检索(IR)相关任务中显示出了令人乐观的结果,其基础是高质量的文本代表学习,以便有效搜索。最近的一些研究显示,基于自动编码器的语言模型能够使用一个薄弱的解码器提高密集的检索性能。然而,我们争辩说,(1) 解码所有输入文本不是歧视性的,(2) 即使是一个薄弱的解码器也对编码器产生绕行效应。因此,在这项工作中,我们引入了一个新的对比跨度预测任务,以单独对编码器进行预培训,但是仍然保留了自动编码器的瓶颈能力。% 因此,在这项工作中,我们提议删除解码器,并引入一个新的对比跨度预测任务,单对编码器进行预设。关键的想法是迫使编码器生成文本代表,使其接近于自己的随机范围,而远远远离使用群体对立度损失的其他人。在这项工作中,我们可以学习有差别的文本表达方式,与群体对立式对比学习横跨宽度的反差能力。(2) 在这项工作中,我们建议删除了解码器前的现有数据检索方法。