Query expansion is an effective approach for mitigating vocabulary mismatch between queries and documents in information retrieval. One recent line of research uses language models to generate query-related contexts for expansion. Along this line, we argue that expansion terms from these contexts should balance two key aspects: diversity and relevance. The obvious way to increase diversity is to sample multiple contexts from the language model. However, this comes at the cost of relevance, because there is a well-known tendency of models to hallucinate incorrect or irrelevant contexts. To balance these two considerations, we propose a combination of an effective filtering strategy and fusion of the retrieved documents based on the generation probability of each context. Our lexical matching based approach achieves a similar top-5/top-20 retrieval accuracy and higher top-100 accuracy compared with the well-established dense retrieval model DPR, while reducing the index size by more than 96%. For end-to-end QA, the reader model also benefits from our method and achieves the highest Exact-Match score against several competitive baselines.
翻译:查询扩展是减少信息检索中的查询和文件之间的词汇错配的有效方法。 最近的一行研究使用语言模型来生成与查询相关的扩展环境。 沿着这一行,我们主张,这些背景的扩展条件应该平衡两个关键方面: 多样性和相关性。 增加多样性的明显方法是从语言模型中抽样多个背景。 然而,这是以相关性为代价的,因为模型有一种众所周知的幻觉错误或不相关环境的倾向。 为了平衡这两个考虑因素,我们建议结合一种有效的过滤战略和根据每种背景的生成概率将检索到的文件合并。 我们基于词汇的匹配方法实现了类似的上至上至上至上至上至20次检索准确性和上至上至上至100次的准确性,而久远的密集检索模型DPR则将指数缩进96%以上。 对于终端至终端的QA,读者模型也受益于我们的方法,并实现了与几个竞争性基线之间的最高Exact- Match评分。