Evaluating long-form responses to research queries heavily relies on expert annotators, restricting attention to areas like AI where researchers can conveniently enlist colleagues. Yet, research expertise is abundant: survey articles consolidate knowledge spread across the literature. We introduce ResearchQA, a resource for evaluating LLM systems by distilling survey articles from 75 research fields into 21K queries and 160K rubric items. Queries and rubrics are jointly derived from survey sections, where rubric items list query-specific answer evaluation criteria, i.e., citing papers, making explanations, and describing limitations. 31 Ph.D. annotators in 8 fields judge that 90% of queries reflect Ph.D. information needs and 87% of rubric items warrant emphasis of a sentence or longer. We leverage ResearchQA to evaluate 18 systems in 7.6K head-to-heads. No parametric or retrieval-augmented system we evaluate exceeds 70% on covering rubric items, and the highest-ranking system shows 75% coverage. Error analysis reveals that the highest-ranking system fully addresses less than 11% of citation rubric items, 48% of limitation items, and 49% of comparison items. We release our data to facilitate more comprehensive multi-field evaluations.
翻译:评估针对研究问题的长文本回答在很大程度上依赖于专家标注者,这导致评估范围通常局限于人工智能等研究者能便捷邀请同行参与的领域。然而,研究专业知识其实分布广泛:综述文章整合了散见于文献中的知识。我们提出了ResearchQA,这是一个通过从75个研究领域的综述文章中提炼出21K个查询问题与160K条评估准则条目,用于评估大语言模型系统的资源。查询问题与评估准则均从综述章节中联合提取,其中评估准则条目列出了针对具体查询的答案评估标准,例如引用文献、进行解释以及描述局限性。来自8个领域的31位博士标注者判定,90%的查询问题反映了博士阶段的信息需求,87%的评估准则条目值得用一句或更长的篇幅来强调。我们利用ResearchQA对18个系统进行了7.6K次直接比较评估。在我们评估的所有参数化或检索增强系统中,没有一个系统在覆盖评估准则条目方面超过70%,表现最佳的系统覆盖率为75%。错误分析显示,表现最佳的系统对引用类评估准则条目的完整覆盖率低于11%,对局限性条目的覆盖率为48%,对比较类条目的覆盖率为49%。我们公开了本数据集,以促进更全面的跨领域评估。