In open-domain question answering, dense passage retrieval has become a new paradigm to retrieve relevant passages for answer finding. Typically, the dual-encoder architecture is adopted to learn dense representations of questions and passages for matching. However, it is difficult to train an effective dual-encoder due to the challenges including the discrepancy between training and inference, the existence of unlabeled positives and limited training data. To address these challenges, we propose an optimized training approach, called RocketQA, to improving dense passage retrieval. We make three major technical contributions in RocketQA, namely cross-batch negatives, denoised negative sampling and data augmentation. Extensive experiments show that RocketQA significantly outperforms previous state-of-the-art models on both MSMARCO and Natural Questions. Besides, built upon RocketQA, we achieve the first rank at the leaderboard of MSMARCO Passage Ranking Task.
翻译:在开放域解答中,密集通道检索已成为获取相关通道以便找到答案的新范式。通常,采用双编码结构来学习密集的问题和通道表达,以进行匹配。然而,由于培训和推断之间的差异、存在未贴标签的正数和有限的培训数据等挑战,很难培训有效的双编码器。为了应对这些挑战,我们提议采用称为火箭QA的优化培训方法来改进密集通道检索。我们在火箭QA中做出了三大技术贡献,即交叉式底盘、分解式负面取样和数据增强。广泛的实验显示,火箭QA大大超越了以前MSMARCO和自然问题方面的最先进的模型。此外,在火箭QA的基础上,我们在MSMARGG Vassage分级任务的领导板上取得了第一等成绩。