Dense passage retrieval aims to retrieve the relevant passages of a query from a large corpus based on dense representations (i.e., vectors) of the query and the passages. Recent studies have explored improving pre-trained language models to boost dense retrieval performance. This paper proposes CoT-MAE (ConTextual Masked Auto-Encoder), a simple yet effective generative pre-training method for dense passage retrieval. CoT-MAE employs an asymmetric encoder-decoder architecture that learns to compress the sentence semantics into a dense vector through self-supervised and context-supervised masked auto-encoding. Precisely, self-supervised masked auto-encoding learns to model the semantics of the tokens inside a text span, and context-supervised masked auto-encoding learns to model the semantical correlation between the text spans. We conduct experiments on large-scale passage retrieval benchmarks and show considerable improvements over strong baselines, demonstrating the high efficiency of CoT-MAE.
翻译:密集通道检索旨在从一个基于查询和段落的密集表达式(即矢量)的大规模查询中检索查询的相关段落。最近的研究探索了改进预先训练的语言模型,以提高密集检索性能。本文件建议使用简单而有效的密集通道检索前训练方法COT-MAE(ConTide Masked Auto-Eccoder),这是一种简单而有效的基因化预培训方法。COT-MAE使用一个不对称编码解码器结构,通过自我监督和环境监督的蒙面自动编码,将句语义压缩到密集矢量中。精确地说,自我监督的蒙面自动编码学会在文本范围内模拟象征物的语义,而由环境监督的蒙面自动编码学会在文字跨度上建模。我们进行了关于大规模通道检索基准的实验,并显示在强大的基线上取得了相当大的改进,展示了COT-MAE的高效率。