As large language models (LLMs) advance, deep research systems can generate expert-level reports via multi-step reasoning and evidence-based synthesis, but evaluating such reports remains challenging. Existing benchmarks often lack systematic criteria for expert reporting, evaluations that rely heavily on LLM judges can fail to capture issues that require expert judgment, and source verification typically covers only a limited subset of explicitly cited statements rather than report-wide factual reliability. We introduce DEER, a benchmark for evaluating expert-level deep research reports. DEER comprises 50 report-writing tasks spanning 13 domains and an expert-grounded evaluation taxonomy (7 dimensions, 25 sub-dimension) operationalized into 130 fine-grained rubric items. DEER further provides task-specific expert guidance to help LLM judges assess expert-level report quality more consistently. Complementing rubric-based assessment, we propose a document-level fact-checking architecture that extracts and verifies all claims across the entire report, including both cited and uncited ones, and quantifies external-evidence quality. DEER correlates closely with human expert judgments and yields interpretable diagnostics of system strengths and weaknesses.
翻译:随着大语言模型(LLMs)的进步,深度研究系统能够通过多步推理和基于证据的综合生成专家级报告,但评估此类报告仍然具有挑战性。现有基准通常缺乏针对专家报告的系统性标准,严重依赖LLM评判者的评估可能无法捕捉需要专家判断的问题,且来源验证通常仅覆盖明确引述陈述的有限子集,而非报告整体的真实性可靠性。我们提出了DEER,一个用于评估专家级深度研究报告的基准。DEER包含跨越13个领域的50个报告撰写任务,以及一个基于专家知识的评估分类体系(7个维度,25个子维度),该体系被具体化为130个细粒度评分项。DEER进一步提供了任务特定的专家指导,以帮助LLM评判者更一致地评估专家级报告质量。作为基于评分项评估的补充,我们提出了一种文档级事实核查架构,该架构提取并验证整个报告中的所有主张,包括已引述和未引述的,并对证据质量进行量化。DEER与人类专家判断高度相关,并能对系统优势与弱点提供可解释的诊断。