Access to the right evidence does not guarantee that large language models (LLMs) will reason with it correctly. This gap between retrieval and reasoning is especially concerning in clinical settings, where outputs must align with structured protocols. We study this gap using Written Exposure Therapy (WET) guidelines as a testbed. In evaluating model responses to curated clinician-vetted questions, we find that errors persist even when authoritative passages are provided. To address this, we propose an evaluation framework that measures accuracy, consistency, and fidelity of reasoning. Our results highlight both the potential and the risks: retrieval-augmented generation (RAG) can constrain outputs, but safe deployment requires assessing reasoning as rigorously as retrieval.
翻译:获取正确的证据并不能保证大型语言模型(LLMs)能够正确地进行推理。这种检索与推理之间的差距在临床环境中尤为令人担忧,因为输出必须符合结构化协议。我们以书面暴露疗法(WET)指南为测试平台,研究这一差距。在评估模型对经过临床医生审核的精选问题的响应时,我们发现即使提供了权威段落,错误仍然存在。为解决这一问题,我们提出了一个评估框架,用于衡量推理的准确性、一致性和忠实度。我们的结果既凸显了潜力也揭示了风险:检索增强生成(RAG)可以约束输出,但安全部署需要像评估检索一样严格地评估推理。