REACT-LLM：评估大语言模型在临床预后任务中与因果特征整合的基准 (REACT-LLM: A Benchmark for Evaluating LLM Integration with Causal Features in Clinical Prognostic Tasks)

Large Language Models (LLMs) and causal learning each hold strong potential for clinical decision making (CDM). However, their synergy remains poorly understood, largely due to the lack of systematic benchmarks evaluating their integration in clinical risk prediction. In real-world healthcare, identifying features with causal influence on outcomes is crucial for actionable and trustworthy predictions. While recent work highlights LLMs' emerging causal reasoning abilities, there lacks comprehensive benchmarks to assess their causal learning and performance informed by causal features in clinical risk prediction. To address this, we introduce REACT-LLM, a benchmark designed to evaluate whether combining LLMs with causal features can enhance clinical prognostic performance and potentially outperform traditional machine learning (ML) methods. Unlike existing LLM-clinical benchmarks that often focus on a limited set of outcomes, REACT-LLM evaluates 7 clinical outcomes across 2 real-world datasets, comparing 15 prominent LLMs, 6 traditional ML models, and 3 causal discovery (CD) algorithms. Our findings indicate that while LLMs perform reasonably in clinical prognostics, they have not yet outperformed traditional ML models. Integrating causal features derived from CD algorithms into LLMs offers limited performance gains, primarily due to the strict assumptions of many CD methods, which are often violated in complex clinical data. While the direct integration yields limited improvement, our benchmark reveals a more promising synergy.

翻译：大语言模型（LLMs）与因果学习各自在临床决策（CDM）中展现出巨大潜力。然而，二者的协同作用仍鲜为人知，这主要源于缺乏系统性的基准来评估它们在临床风险预测中的整合。在现实世界的医疗保健中，识别对结果具有因果影响的特征对于生成可操作且可信的预测至关重要。尽管近期研究突显了LLMs新兴的因果推理能力，但尚缺乏全面的基准来评估其在临床风险预测中基于因果特征的学习和性能。为解决这一问题，我们引入了REACT-LLM，这是一个旨在评估将LLMs与因果特征结合是否能提升临床预后性能，并可能超越传统机器学习（ML）方法的基准。与现有通常关注有限结果集的LLM-临床基准不同，REACT-LLM在两个真实世界数据集上评估了7种临床结果，比较了15个主流LLMs、6个传统ML模型以及3种因果发现（CD）算法。我们的研究结果表明，尽管LLMs在临床预后中表现尚可，但尚未超越传统ML模型。将从CD算法导出的因果特征整合到LLMs中带来的性能提升有限，这主要归因于许多CD方法的严格假设在复杂的临床数据中常被违背。虽然直接整合带来的改进有限，但我们的基准揭示了一种更具前景的协同作用。