The rapid advancement of large language models (LLMs) and multimodal foundation models has sparked growing interest in their potential for scientific research. However, scientific intelligence encompasses a broad spectrum of abilities ranging from understanding fundamental knowledge to conducting creative discovery, and existing benchmarks remain fragmented. Most focus on narrow tasks and fail to reflect the hierarchical and multi-disciplinary nature of real scientific inquiry. We introduce \textbf{HiSciBench}, a hierarchical benchmark designed to evaluate foundation models across five levels that mirror the complete scientific workflow: \textit{Scientific Literacy} (L1), \textit{Literature Parsing} (L2), \textit{Literature-based Question Answering} (L3), \textit{Literature Review Generation} (L4), and \textit{Scientific Discovery} (L5). HiSciBench contains 8,735 carefully curated instances spanning six major scientific disciplines, including mathematics, physics, chemistry, biology, geography, and astronomy, and supports multimodal inputs including text, equations, figures, and tables, as well as cross-lingual evaluation. Unlike prior benchmarks that assess isolated abilities, HiSciBench provides an integrated, dependency-aware framework that enables detailed diagnosis of model capabilities across different stages of scientific reasoning. Comprehensive evaluations of leading models, including GPT-5, DeepSeek-R1, and several multimodal systems, reveal substantial performance gaps: while models achieve up to 69\% accuracy on basic literacy tasks, performance declines sharply to 25\% on discovery-level challenges. HiSciBench establishes a new standard for evaluating scientific Intelligence and offers actionable insights for developing models that are not only more capable but also more reliable. The benchmark will be publicly released to facilitate future research.
翻译:大型语言模型(LLM)与多模态基础模型的快速发展,激发了对其科研潜力的日益关注。然而,科学智能涵盖从理解基础知识到进行创造性发现的广泛能力,现有基准仍显碎片化。多数基准聚焦于狭窄任务,未能反映真实科学探究的层次性与多学科特性。本文提出 \textbf{HiSciBench},一个分层基准,旨在通过模拟完整科研工作流的五个层级评估基础模型:\textit{科学素养}(L1)、\textit{文献解析}(L2)、\textit{基于文献的问答}(L3)、\textit{文献综述生成}(L4)与\textit{科学发现}(L5)。HiSciBench 包含 8,735 个精心构建的实例,涵盖数学、物理、化学、生物、地理和天文学六大科学学科,支持文本、公式、图表等多模态输入以及跨语言评估。与以往评估孤立能力的基准不同,HiSciBench 提供了一个集成、依赖感知的框架,能够对模型在科学推理不同阶段的能力进行细致诊断。对包括 GPT-5、DeepSeek-R1 及多个多模态系统在内的领先模型的综合评估揭示了显著的性能差距:模型在基础素养任务上准确率最高可达 69%,但在发现级挑战上性能急剧下降至 25%。HiSciBench 为评估科学智能设立了新标准,并为开发不仅能力更强且更可靠的模型提供了可操作的洞见。该基准将公开发布以促进未来研究。