Large language models (LLMs) have emerged as valuable tools for many natural language understanding tasks. In safety-critical applications such as healthcare, the utility of these models is governed by their ability to generate outputs that are factually accurate and complete. In this work, we present dialog-enabled resolving agents (DERA). DERA is a paradigm made possible by the increased conversational abilities of LLMs, namely GPT-4. It provides a simple, interpretable forum for models to communicate feedback and iteratively improve output. We frame our dialog as a discussion between two agent types - a Researcher, who processes information and identifies crucial problem components, and a Decider, who has the autonomy to integrate the Researcher's information and makes judgments on the final output. We test DERA against three clinically-focused tasks. For medical conversation summarization and care plan generation, DERA shows significant improvement over the base GPT-4 performance in both human expert preference evaluations and quantitative metrics. In a new finding, we also show that GPT-4's performance (70%) on an open-ended version of the MedQA question-answering (QA) dataset (Jin et al. 2021, USMLE) is well above the passing level (60%), with DERA showing similar performance. We release the open-ended MEDQA dataset at https://github.com/curai/curai-research/tree/main/DERA.
翻译:大型语言模型(LLM)已经成为许多自然语言理解任务的有价值工具。在医疗等安全关键应用中,这些模型的效用受到其生成的输出的事实准确性和完整性的影响。本文介绍了一种通过提高LLM的对话能力(即GPT-4)实现的范式——对话启用的解决代理(DERA)。它为模型提供了一个简单、可解释的论坛,用于沟通反馈信息并迭代改进输出。我们将我们的对话框架定为两种代理类型之间的讨论——研究者处理信息并确定关键问题部分,决策者则有权利将研究者的信息整合并对最终输出做出判断。我们针对三个以临床为中心的任务对DERA进行了测试。对于医疗对话摘要和治疗计划生成,在人类专家偏好评估和定量度量方面DERA在GPT-4基础性能上表现出显著提高。在一个新的发现中,我们还展示了GPT-4对于MEDQA问答数据集的开放版本(70%)的性能(Jin等人,2021年,USMLE),远高于及格水平(60%),而DERA表现出类似的性能。我们在https://github.com/curai/curai-research/tree/main/DERA发布了开放式MEDQA数据集。