Doc2Query -- the process of expanding the content of a document before indexing using a sequence-to-sequence model -- has emerged as a prominent technique for improving the first-stage retrieval effectiveness of search engines. However, sequence-to-sequence models are known to be prone to "hallucinating" content that is not present in the source text. We argue that Doc2Query is indeed prone to hallucination, which ultimately harms retrieval effectiveness and inflates the index size. In this work, we explore techniques for filtering out these harmful queries prior to indexing. We find that using a relevance model to remove poor-quality queries can improve the retrieval effectiveness of Doc2Query by up to 16%, while simultaneously reducing mean query execution time by 23% and cutting the index size by 33%. We release the code, data, and a live demonstration to facilitate reproduction and further exploration at https://github.com/terrierteam/pyterrier_doc2query.
翻译:Doc2Query -- -- 在使用顺序到顺序模型编制索引之前扩大文件内容的过程 -- -- 已成为提高搜索引擎第一阶段检索效力的突出技术。然而,据知序列到顺序模型容易“促进”源文本中不存在的内容。我们争辩说,Doc2Query确实容易产生幻觉,最终会损害检索效力并增加索引大小。在这项工作中,我们探索了在索引之前过滤这些有害查询的技术。我们发现,使用相关模型消除低质量查询可以提高Doc2Query的检索效力,最高可达16%,同时将平均查询执行时间减少23%,并将索引大小减少33%。我们发布代码、数据,并在https://github.com/terrierteam/pyterrier_doc2query上进行现场演示,以便利复制和进一步探索。</s>