CorpusBrain: 培训前知识密集语言任务创造检索模式 (CorpusBrain: Pre-train a Generative Retrieval Model for Knowledge-Intensive Language Tasks)

Knowledge-intensive language tasks (KILT) usually require a large body of information to provide correct answers. A popular paradigm to solve this problem is to combine a search system with a machine reader, where the former retrieves supporting evidences and the latter examines them to produce answers. Recently, the reader component has witnessed significant advances with the help of large-scale pre-trained generative models. Meanwhile most existing solutions in the search component rely on the traditional ``index-retrieve-then-rank'' pipeline, which suffers from large memory footprint and difficulty in end-to-end optimization. Inspired by recent efforts in constructing model-based IR models, we propose to replace the traditional multi-step search pipeline with a novel single-step generative model, which can dramatically simplify the search process and be optimized in an end-to-end manner. We show that a strong generative retrieval model can be learned with a set of adequately designed pre-training tasks, and be adopted to improve a variety of downstream KILT tasks with further fine-tuning. We name the pre-trained generative retrieval model as CorpusBrain as all information about the corpus is encoded in its parameters without the need of constructing additional index. Empirical results show that CorpusBrain can significantly outperform strong baselines for the retrieval task on the KILT benchmark and establish new state-of-the-art downstream performances. We also show that CorpusBrain works well under zero- and low-resource settings.

翻译：知识密集型语言任务(KILT)通常需要大量信息才能提供正确的答案。解决这一问题的流行范例是将搜索系统与机器阅读器结合起来,前者检索辅助证据,后者检查这些证据以提出答案。最近,在大规模预先培训的基因化模型的帮助下,读者组成部分取得了重大进步。与此同时,搜索部分的大多数现有解决方案依赖传统的“index-reretreve-n-n-ran' 管道”,该管道有巨大的记忆足迹,在端到端优化方面有困难。由于最近努力建立基于模型的IR模型,我们提议用新的单一步骤基因化模型取代传统的多步骤搜索管道,该模型可以大大简化搜索过程,并以端到端的方式加以优化。我们表明,通过一套设计适当的培训前任务,可以学习一个强大的基因化的检索模型,并用来改进各种下游的KITLT任务,并进行进一步的微调。我们指出,经过培训的基因化恢复模型需要作为基于模型的模型的多步骤,作为关于该模型的所有新的信息,我们建议用新的单一步骤的基因-B级基因结构,在基准下游基准中可以大量地显示其基准任务,在基准中,我们可以显示其基准中,在基准中显示其基准中,在基准中,在基准中可以显示其基准中进行。

相关内容

MoDELS

关注 43

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【CVPR 2022】多模态视频字幕的端到端生成预训练，End-to-end Generative Pretraining for Multimodal Video Captioning

专知会员服务

27+阅读 · 2022年3月3日

2020数据工程师成长路线图

专知会员服务

19+阅读 · 2020年9月6日