基于动态测试集评估混合检索增强生成：LiveRAG挑战赛 (Evaluating Hybrid Retrieval Augmented Generation using Dynamic Test Sets: LiveRAG Challenge)

We present our submission to the LiveRAG Challenge 2025, which evaluates retrieval-augmented generation (RAG) systems on dynamic test sets using the FineWeb-10BT corpus. Our final hybrid approach combines sparse (BM25) and dense (E5) retrieval methods and then aims to generate relevant and faithful answers with Falcon3-10B-Instruct. Through systematic evaluation on 200 synthetic questions generated with DataMorgana across 64 unique question-user combinations, we demonstrate that neural re-ranking with RankLLaMA improves MAP from 0.523 to 0.797 (52% relative improvement) but introduces prohibitive computational costs (84s vs 1.74s per question). While DSPy-optimized prompting strategies achieved higher semantic similarity (0.771 vs 0.668), their 0% refusal rates raised concerns about over-confidence and generalizability. Our submitted hybrid system without re-ranking achieved 4th place in faithfulness and 11th place in correctness among 25 teams. Analysis across question categories reveals that vocabulary alignment between questions and documents was the strongest predictor of performance on our development set, with document-similar phrasing improving cosine similarity from 0.562 to 0.762.

翻译：本文介绍了我们提交给LiveRAG挑战赛2025的方案，该挑战赛使用FineWeb-10BT语料库，在动态测试集上评估检索增强生成（RAG）系统。我们最终的混合方法结合了稀疏检索（BM25）和稠密检索（E5）方法，然后旨在使用Falcon3-10B-Instruct生成相关且忠实（faithful）的答案。通过对DataMorgana生成的200个合成问题（涵盖64个独特的“问题-用户”组合）进行系统评估，我们证明，使用RankLLaMA进行神经重排序将平均准确率均值（MAP）从0.523提升至0.797（相对提升52%），但引入了过高的计算成本（每个问题84秒 vs 1.74秒）。虽然DSPy优化的提示策略实现了更高的语义相似度（0.771 vs 0.668），但其0%的拒绝率引发了关于过度自信和泛化能力的担忧。我们提交的未使用重排序的混合系统在25支队伍中，忠实度排名第4，正确性排名第11。跨问题类别的分析表明，问题与文档之间的词汇对齐是我们开发集上性能的最强预测因子，文档相似措辞将余弦相似度从0.562提高到了0.762。