Dense passage retrieval aims to retrieve the relevant passages of a query from a large corpus based on dense representations (i.e., vectors) of the query and the passages. Recent studies have explored improving pre-trained language models to boost dense retrieval performance. This paper proposes CoT-MAE (ConTextual Masked Auto-Encoder), a simple yet effective generative pre-training method for dense passage retrieval. CoT-MAE employs an asymmetric encoder-decoder architecture that learns to compress the sentence semantics into a dense vector through self-supervised and context-supervised masked auto-encoding. Precisely, self-supervised masked auto-encoding learns to model the semantics of the tokens inside a text span, and context-supervised masked auto-encoding learns to model the semantical correlation between the text spans. We conduct experiments on large-scale passage retrieval benchmarks and show considerable improvements over strong baselines, demonstrating the high efficiency of CoT-MAE. Our code is available at https://github.com/caskcsg/ir/tree/main/cotmae.
翻译:读取常识通道的目的是从一个基于查询和段落的密集表达式(即矢量)和段落的密集表达式(即矢量)的大型系统中检索查询的相关段落。最近的研究已经探索了改进预先训练的语文模型,以提高密集检索性能。本文件提议了COT-MAE(ConTopy Masked Auto-Encoder),这是一个简单而有效的基因化预培训方法,用于密集通道检索。CoT-MAE使用一个不对称编码编码解码器结构,通过自我监督和环境监督的蒙面自动编码将句语义压缩到密集的矢量中。精确地说,自我监督的蒙面自动编码学会在文本范围内模拟象征物的语义,而环境监督的蒙面自动编码学则学习在文字跨度上建模。CoT/mathub.com/kaskascs/maine/maine/mainers/mainertle.