多机会检索培训前培训监督薄弱 (Weakly Supervised Pre-Training for Multi-Hop Retriever)

In multi-hop QA, answering complex questions entails iterative document retrieval for finding the missing entity of the question. The main steps of this process are sub-question detection, document retrieval for the sub-question, and generation of a new query for the final document retrieval. However, building a dataset that contains complex questions with sub-questions and their corresponding documents requires costly human annotation. To address the issue, we propose a new method for weakly supervised multi-hop retriever pre-training without human efforts. Our method includes 1) a pre-training task for generating vector representations of complex questions, 2) a scalable data generation method that produces the nested structure of question and sub-question as weak supervision for pre-training, and 3) a pre-training model structure based on dense encoders. We conduct experiments to compare the performance of our pre-trained retriever with several state-of-the-art models on end-to-end multi-hop QA as well as document retrieval. The experimental results show that our pre-trained retriever is effective and also robust on limited data and computational resources.

翻译：在多跳QA中,回答复杂的问题需要反复检索文件,以便找到问题缺失的实体。这一过程的主要步骤是分问题检测、分问题的文件检索和为最后文件检索生成新的查询。然而,建立一个包含复杂问题的数据集,其中含有分问题及其相应文件,需要花费昂贵的人文说明。为了解决这个问题,我们提出了一个新的方法,用于在没有人的努力下进行监督不力的多跳检索器培训前的训练前的新方法。我们的方法包括:(1) 培训前生成复杂问题的矢量表达方法;(2) 可扩缩的数据生成方法,生成嵌套的问题和子问题结构,作为训练前的薄弱监督;(3) 培训前模型结构,以密集的编码为基础。我们进行实验,将我们受过训练的检索器的性能与一些关于端到多跳的检索器以及文件检索的最新模型进行比较。实验结果显示,我们经过培训的检索器对有限的数据和计算资源是有效的,而且也很可靠。