In realistic retrieval settings with large and evolving knowledge bases, the total number of documents relevant to a query is typically unknown, and recall cannot be computed. In this paper, we evaluate several established strategies for handling this limitation by measuring the correlation between retrieval quality metrics and LLM-based judgments of response quality, where responses are generated from the retrieved documents. We conduct experiments across multiple datasets with a relatively low number of relevant documents (2-15). We also introduce a simple retrieval quality measure that performs well without requiring knowledge of the total number of relevant documents.
翻译:在现实检索场景中,由于知识库规模庞大且持续更新,查询相关的文档总数通常未知,因此无法计算召回率。本文通过评估检索质量指标与基于大语言模型(LLM)的响应质量判断之间的相关性,系统检验了处理这一局限性的多种现有策略——其中响应内容均依据检索文档生成。我们在多个数据集(相关文档数较少,约2-15篇)上进行了实验,并提出一种无需已知相关文档总数即可有效评估检索质量的简易度量方法。