Knowledge-intensive language tasks (KILT) usually require a large body of information to provide correct answers. A popular paradigm to solve this problem is to combine a search system with a machine reader, where the former retrieves supporting evidences and the latter examines them to produce answers. Recently, the reader component has witnessed significant advances with the help of large-scale pre-trained generative models. Meanwhile most existing solutions in the search component rely on the traditional ``index-retrieve-then-rank'' pipeline, which suffers from large memory footprint and difficulty in end-to-end optimization. Inspired by recent efforts in constructing model-based IR models, we propose to replace the traditional multi-step search pipeline with a novel single-step generative model, which can dramatically simplify the search process and be optimized in an end-to-end manner. We show that a strong generative retrieval model can be learned with a set of adequately designed pre-training tasks, and be adopted to improve a variety of downstream KILT tasks with further fine-tuning. We name the pre-trained generative retrieval model as CorpusBrain as all information about the corpus is encoded in its parameters without the need of constructing additional index. Empirical results show that CorpusBrain can significantly outperform strong baselines for the retrieval task on the KILT benchmark and establish new state-of-the-art downstream performances. We also show that CorpusBrain works well under zero- and low-resource settings.
翻译:知识密集型语言任务(KILT)通常需要大量信息才能提供正确的答案。 解决这一问题的流行范例是将搜索系统与机器阅读器结合起来,前者检索辅助证据,后者检查这些证据以提出答案。最近,在大规模预先培训的基因化模型的帮助下,读者组成部分取得了重大进步。与此同时,搜索部分的大多数现有解决方案依赖传统的“index-reretreve-n-n-ran' 管道”,该管道有巨大的记忆足迹,在端到端优化方面有困难。由于最近努力建立基于模型的IR模型,我们提议用新的单一步骤基因化模型取代传统的多步骤搜索管道,该模型可以大大简化搜索过程,并以端到端的方式加以优化。我们表明,通过一套设计适当的培训前任务,可以学习一个强大的基因化的检索模型,并用来改进各种下游的KITLT任务,并进行进一步的微调。我们指出,经过培训的基因化恢复模型需要作为基于模型的模型的多步骤,作为关于该模型的所有新的信息,我们建议用新的单一步骤的基因-B级基因结构,在基准下游基准中可以大量地显示其基准任务,在基准中,我们可以显示其基准中,在基准中显示其基准中,在基准中,在基准中可以显示其基准中进行。