用于 Open- domay 问答问答的下一代推荐检索器 (Generation-Augmented Retrieval for Open-domain Question Answering)

from arxiv, Added experiments with a generative reader. Current performance: EM=41.8 (43.8 +DPR) on NQ and 62.7 on Trivia with BERT-base (extractive); EM=38.1 (45.3 +DPR) on NQ and 61.8 on Trivia with BART-large (generative)

Conventional sparse retrieval methods such as TF-IDF and BM25 are simple and efficient, but solely rely on lexical overlap without semantic matching. Recent dense retrieval methods learn latent representations to tackle the lexical mismatch problem, while being more computationally expensive and insufficient for exact matching as they embed the text sequence into a single vector with limited capacity. In this paper, we present Generation-Augmented Retrieval (GAR), a query expansion method that augments a query with relevant contexts through text generation. We demonstrate on open-domain question answering that the generated contexts significantly enrich the semantics of the queries and thus GAR with sparse representations (BM25) achieves comparable or better performance than the state-of-the-art dense methods such as DPR \cite{karpukhin2020dense}. We show that generating various contexts of a query is beneficial as fusing their results consistently yields better retrieval accuracy. Moreover, as sparse and dense representations are often complementary, GAR can be easily combined with DPR to achieve even better performance. Furthermore, GAR achieves the state-of-the-art performance on the Natural Questions and TriviaQA datasets under the extractive setting when equipped with an extractive reader, and consistently outperforms other retrieval methods when the same generative reader is used.

翻译：TF- IDF 和 BM25 等常规稀有的常规检索方法简单而有效,但仅依赖词汇重叠而不进行语义匹配。最近密集的检索方法在将文本序列嵌入一个容量有限的单一矢量时,在计算上成本更高,更不足以精确匹配,同时将文本序列嵌入到一个容量有限的单一矢量中,学习潜在表达方式,而最近密集的检索方法则学习了解决词汇错配问题的潜在表达方式。在本文中,我们介绍新一代强化检索法(GAR),这是一种通过生成文本来增加相关背景查询的扩展方法。我们在公开的回答问题上表明,所产生的背景极大地丰富了查询的语义,因此,以稀疏的表达方式(B25)GAR取得了类似或更好的表现。我们表明,生成各种查询环境有利于始终保持其结果的检索准确性。此外,由于广密和密集的表述方式往往可以与DPR相结合,从而实现更好的业绩。此外,GAR在具有先进特性的读者业绩时,在不断的提取方法下,在不断使用其他的检索方法下,GAR能够实现。