Information retrieval (IR) is essential in search engines and dialogue systems as well as natural language processing tasks such as open-domain question answering. IR serve an important function in the biomedical domain, where content and sources of scientific knowledge may evolve rapidly. Although neural retrievers have surpassed traditional IR approaches such as TF-IDF and BM25 in standard open-domain question answering tasks, they are still found lacking in the biomedical domain. In this paper, we seek to improve information retrieval (IR) using neural retrievers (NR) in the biomedical domain, and achieve this goal using a three-pronged approach. First, to tackle the relative lack of data in the biomedical domain, we propose a template-based question generation method that can be leveraged to train neural retriever models. Second, we develop two novel pre-training tasks that are closely aligned to the downstream task of information retrieval. Third, we introduce the ``Poly-DPR'' model which encodes each context into multiple context vectors. Extensive experiments and analysis on the BioASQ challenge suggest that our proposed method leads to large gains over existing neural approaches and beats BM25 in the small-corpus setting. We show that BM25 and our method can complement each other, and a simple hybrid model leads to further gains in the large corpus setting.
翻译:信息检索(IR)对于搜索引擎和对话系统以及自然语言处理任务(如开放式问题回答)来说至关重要。 信息检索(IR)在生物医学领域起着重要作用,因为生物医学领域的内容和科学知识来源可能迅速发展。虽然神经检索器已经超过了传统IR方法(如标准开放式问题回答任务中的TF-IDF和BM25),但在生物医学领域仍然缺乏信息检索。在本文件中,我们寻求利用生物医学领域的神经检索器(NR)改进信息检索(IR),并采用三管齐下的方法实现这一目标。 首先,为了解决生物医学领域相对缺乏数据的问题,我们提议一种基于模板的问题生成方法,可以用来培训神经检索模型。第二,我们制定了两项与下游信息检索任务密切相关的新颖的培训前任务。第三,我们引入了“Poly-DPR”模型,该模型将每种环境都编码为多种环境矢量矢量。关于生物统计挑战的广泛实验和分析表明,我们提出的方法可以大大超越现有的神经采集方法,25 并击败了我们每一个移动的BM25系统。