Query Expansion (QE) enriches queries and Document Expansion (DE) enriches documents, and these two techniques are often applied separately. However, such separate application may lead to semantic misalignment between the expanded queries (or documents) and their relevant documents (or queries). To address this serious issue, we propose TCDE, a dual expansion strategy that leverages large language models (LLMs) for topic-centric enrichment on both queries and documents. In TCDE, we design two distinct prompt templates for processing each query and document. On the query side, an LLM is guided to identify distinct sub-topics within each query and generate a focused pseudo-document for each sub-topic. On the document side, an LLM is guided to distill each document into a set of core topic sentences. The resulting outputs are used to expand the original query and document. This topic-centric dual expansion process establishes semantic bridges between queries and their relevant documents, enabling better alignment for downstream retrieval models. Experiments on two challenging benchmarks, TREC Deep Learning and BEIR, demonstrate that TCDE achieves substantial improvements over strong state-of-the-art expansion baselines. In particular, on dense retrieval tasks, it outperforms several state-of-the-art methods, with a relative improvement of 2.8\% in NDCG@10 on the SciFact dataset. Experimental results validate the effectiveness of our topic-centric and dual expansion strategy.
翻译:查询扩展(QE)用于丰富查询内容,文档扩展(DE)用于丰富文档内容,这两种技术通常被独立应用。然而,这种独立应用可能导致扩展后的查询(或文档)与其相关文档(或查询)之间的语义失配。为解决这一严重问题,我们提出TCDE——一种利用大语言模型(LLMs)对查询和文档进行主题中心化增强的双重扩展策略。在TCDE中,我们设计了两种不同的提示模板分别处理查询和文档。在查询侧,引导大语言模型识别每个查询中的不同子主题,并为每个子主题生成聚焦的伪文档。在文档侧,引导大语言模型将每篇文档提炼为一组核心主题句。生成的输出用于扩展原始查询和文档。这种以主题为中心的双重扩展过程在查询及其相关文档之间建立了语义桥梁,使下游检索模型能够实现更好的对齐。在TREC Deep Learning和BEIR两个具有挑战性的基准测试上的实验表明,TCDE相较于当前先进的扩展基线方法取得了显著提升。特别是在稠密检索任务中,其性能优于多种前沿方法,在SciFact数据集上的NDCG@10指标相对提升了2.8%。实验结果验证了我们提出的主题中心化双重扩展策略的有效性。