Physicians considering clinical trials for their patients are met with the laborious process of checking many text based eligibility criteria. Large Language Models (LLMs) have shown to perform well for clinical information extraction and clinical reasoning, including medical tests, but not yet in real-world scenarios. This paper investigates the use of InstructGPT to assist physicians in determining eligibility for clinical trials based on a patient's summarised medical profile. Using a prompting strategy combining one-shot, selection-inference and chain-of-thought techniques, we investigate the performance of LLMs on 10 synthetically created patient profiles. Performance is evaluated at four levels: ability to identify screenable eligibility criteria from a trial given a medical profile; ability to classify for each individual criterion whether the patient qualifies; the overall classification whether a patient is eligible for a clinical trial and the percentage of criteria to be screened by physician. We evaluated against 146 clinical trials and a total of 4,135 eligibility criteria. The LLM was able to correctly identify the screenability of 72% (2,994/4,135) of the criteria. Additionally, 72% (341/471) of the screenable criteria were evaluated correctly. The resulting trial level classification as eligible or ineligible resulted in a recall of 0.5. By leveraging LLMs with a physician-in-the-loop, a recall of 1.0 and precision of 0.71 on clinical trial level can be achieved while reducing the amount of criteria to be checked by an estimated 90%. LLMs can be used to assist physicians with pre-screening of patients for clinical trials. By forcing instruction-tuned LLMs to produce chain-of-thought responses, the reasoning can be made transparent to and the decision process becomes amenable by physicians, thereby making such a system feasible for use in real-world scenarios.
翻译:医生在考虑为其患者进行临床试验时,需检查许多基于文本的符合资格标准,这是一项繁琐的过程。已经证明大型语言模型(LLM)在临床信息提取和临床推理方面表现良好,包括医学测试,但尚未在实际情况下应用。本文研究了使用InstructGPT(注释GPT)来根据患者总结的医学档案协助医生确定其符合临床试验资格的可行性。通过结合一次性、选择推理和思路链技术的提示策略,我们研究了LLM在10个合成患者档案上的性能。在四个层面上进行了评估:根据患者医学档案所给出的临床试验能否筛选资格标准; 对于每个独立的资格标准,是否能够对患者做出分类判定; 关于患者是否符合临床试验的总体分类结果以及需要医生进行筛选的标准百分比。我们评估了146个临床试验和共4135个符合资格标准。LLM能够正确识别72%(2994/4135)的可筛选资格标准。此外,正确评估了72%(341/471)的可筛选标准。通过利用带有医生的LLM,可以达到召回率为1.0,临床试验水平的精确度为0.71,同时减少90%左右的要检查的资格标准数量。LLM可用于协助医生为临床试验筛选患者。通过强制调整LLM以产生思路链式回应,可以使推理过程对医生透明,并使决策过程成为医生可接受的,从而使这种系统可在实际情况下应用。