Recently, dense passage retrieval has become a mainstream approach to finding relevant information in various natural language processing tasks. A number of studies have been devoted to improving the widely adopted dual-encoder architecture. However, most of the previous studies only consider query-centric similarity relation when learning the dual-encoder retriever. In order to capture more comprehensive similarity relations, we propose a novel approach that leverages both query-centric and PAssage-centric sImilarity Relations (called PAIR) for dense passage retrieval. To implement our approach, we make three major technical contributions by introducing formal formulations of the two kinds of similarity relations, generating high-quality pseudo labeled data via knowledge distillation, and designing an effective two-stage training procedure that incorporates passage-centric similarity relation constraint. Extensive experiments show that our approach significantly outperforms previous state-of-the-art models on both MSMARCO and Natural Questions datasets.
翻译:最近,密集通道检索已成为在各种自然语言处理任务中寻找相关信息的主流方法,许多研究都致力于改进广泛采用的双重编码结构,然而,前几份研究大多只在学习双重编码检索器时考虑以问答为中心的相似性关系。为了捕捉更全面的相似性关系,我们建议采用新的方法,利用以问答为中心的和以父母为核心的相似性关系(称为PAIR)进行密集通道检索。为了执行我们的方法,我们作出了三大技术贡献,即采用两种相似关系的正式配方,通过知识蒸馏产生高质量的伪标签数据,并设计一个有效的两阶段培训程序,纳入以传记为核心的类似关系限制。广泛的实验表明,我们的方法大大优于以往关于MSMARCO和自然问题数据集的先进模型。