Critical appraisal of scientific literature is an essential skill in the biomedical field. While large language models (LLMs) can offer promising support in this task, their reliability remains limited, particularly for critical reasoning in specialized domains. We introduce CareMedEval, an original dataset designed to evaluate LLMs on biomedical critical appraisal and reasoning tasks. Derived from authentic exams taken by French medical students, the dataset contains 534 questions based on 37 scientific articles. Unlike existing benchmarks, CareMedEval explicitly evaluates critical reading and reasoning grounded in scientific papers. Benchmarking state-of-the-art generalist and biomedical-specialized LLMs under various context conditions reveals the difficulty of the task: open and commercial models fail to exceed an Exact Match Rate of 0.5 even though generating intermediate reasoning tokens considerably improves the results. Yet, models remain challenged especially on questions about study limitations and statistical analysis. CareMedEval provides a challenging benchmark for grounded reasoning, exposing current LLM limitations and paving the way for future development of automated support for critical appraisal.
翻译:对科学文献进行批判性评价是生物医学领域的一项核心技能。尽管大语言模型(LLMs)在此任务中展现出潜在的支持能力,但其可靠性仍然有限,特别是在专业领域的批判性推理方面。我们提出了CareMedEval,这是一个原创数据集,旨在评估LLMs在生物医学批判性评价与推理任务上的表现。该数据集源自法国医学生的真实考试内容,包含基于37篇科学文献的534道问题。与现有基准不同,CareMedEval明确评估基于科学论文的批判性阅读与推理能力。在多种上下文条件下对当前最先进的通用型和生物医学专用LLMs进行基准测试,揭示了该任务的难度:即使生成中间推理标记能显著提升结果,开源和商业模型的精确匹配率仍无法超过0.5。然而,模型尤其在涉及研究局限性和统计分析的问题上表现欠佳。CareMedEval为基于文本的推理提供了一个具有挑战性的基准,揭示了当前LLMs的局限性,并为未来开发自动化批判性评价支持工具铺平了道路。