Systematic reviews are crucial for synthesizing scientific evidence but remain labor-intensive, especially when extracting detailed methodological information. Large language models (LLMs) offer potential for automating methodological assessments, promising to transform evidence synthesis. Here, using causal mediation analysis as a representative methodological domain, we benchmarked state-of-the-art LLMs against expert human reviewers across 180 full-text scientific articles. Model performance closely correlated with human judgments (accuracy correlation 0.71; F1 correlation 0.97), achieving near-human accuracy on straightforward, explicitly stated methodological criteria. However, accuracy sharply declined on complex, inference-intensive assessments, lagging expert reviewers by up to 15%. Errors commonly resulted from superficial linguistic cues -- for instance, models frequently misinterpreted keywords like "longitudinal" or "sensitivity" as automatic evidence of rigorous methodological approache, leading to systematic misclassifications. Longer documents yielded lower model accuracy, whereas publication year showed no significant effect. Our findings highlight an important pattern for practitioners using LLMs for methods review and synthesis from full texts: current LLMs excel at identifying explicit methodological features but require human oversight for nuanced interpretations. Integrating automated information extraction with targeted expert review thus provides a promising approach to enhance efficiency and methodological rigor in evidence synthesis across diverse scientific fields.
翻译:系统综述对于综合科学证据至关重要,但其过程仍然劳动密集,尤其是在提取详细方法学信息时。大型语言模型(LLMs)为自动化方法学评估提供了潜力,有望变革证据综合领域。本研究以因果中介分析作为代表性方法学领域,在180篇全文科学文献上,将最先进的大型语言模型与专家人工评审员进行了基准测试。模型表现与人类判断高度相关(准确度相关性0.71;F1分数相关性0.97),在直接、明确陈述的方法学标准上达到了接近人类的准确度。然而,在复杂、需要大量推理的评估任务上,模型准确度急剧下降,落后专家评审员高达15%。错误通常源于对表面语言线索的依赖——例如,模型经常将“纵向”或“敏感性”等关键词误解为严谨方法学证据的自动标志,从而导致系统性误分类。文档长度增加会导致模型准确度降低,而发表年份则无显著影响。我们的研究结果为从业者使用LLMs进行全文方法学综述与综合揭示了一个重要模式:当前LLMs擅长识别明确的方法学特征,但在需要细致入微的解读时仍需人类监督。因此,将自动化信息提取与有针对性的专家评审相结合,为提升跨学科证据综合的效率和方法的严谨性提供了一条前景广阔的路径。