Training effective dense retrieval models typically relies on hard negative (HN) examples mined from large document corpora using methods such as BM25 or cross-encoders (CE), which require full corpus access. We propose a corpus-free alternative: an end-to-end pipeline where a Large Language Model (LLM) first generates a query from a passage and then produces a hard negative example using only the generated query text. Our dataset comprises 7,250 arXiv abstracts spanning diverse domains including mathematics, physics, computer science, and related fields, serving as positive passages for query generation. We evaluate two fine-tuning configurations of DistilBERT for dense retrieval; one using LLM-generated hard negatives conditioned solely on the query, and another using negatives generated with both the query and its positive document as context. Compared to traditional corpus-based mining methods {LLM Query $\rightarrow$ BM25 HN and LLM Query $\rightarrow$ CE HN on multiple BEIR benchmark datasets, our all-LLM pipeline outperforms strong lexical mining baselines and achieves performance comparable to cross-encoder-based methods, demonstrating the potential of corpus-free hard negative generation for retrieval model training.
翻译:训练有效的稠密检索模型通常依赖于从大规模文档语料库中挖掘的困难负例,这些负例通过BM25或交叉编码器等方法获取,需要完整的语料库访问权限。我们提出一种无需语料库的替代方案:一种端到端流程,其中大型语言模型首先根据段落生成查询,随后仅利用生成的查询文本生成困难负例。我们的数据集包含7,250篇涵盖数学、物理学、计算机科学及相关领域的arXiv摘要,作为查询生成的正向段落。我们评估了DistilBERT在稠密检索中的两种微调配置:一种仅基于查询条件使用LLM生成的困难负例,另一种则同时使用查询及其正向文档作为上下文生成的负例。与传统基于语料库的挖掘方法(包括在多个BEIR基准数据集上使用的LLM查询→BM25困难负例和LLM查询→交叉编码器困难负例)相比,我们的全LLM流程在性能上超越了强力的词汇挖掘基线,并达到了与基于交叉编码器方法相当的水平,这证明了无需语料库的困难负例生成在检索模型训练中的潜力。