Recent advances in large language models (LLMs) have shown impressive ability in biomedical question-answering, but have not been adequately investigated for more specific biomedical applications. This study investigates the performance of LLMs such as the ChatGPT family of models (GPT-3.5s, GPT-4) in biomedical tasks beyond question-answering. Because no patient data can be passed to the OpenAI API public interface, we evaluated model performance with over 10000 samples as proxies for two fundamental tasks in the clinical domain - classification and reasoning. The first task is classifying whether statements of clinical and policy recommendations in scientific literature constitute health advice. The second task is causal relation detection from the biomedical literature. We compared LLMs with simpler models, such as bag-of-words (BoW) with logistic regression, and fine-tuned BioBERT models. Despite the excitement around viral ChatGPT, we found that fine-tuning for two fundamental NLP tasks remained the best strategy. The simple BoW model performed on par with the most complex LLM prompting. Prompt engineering required significant investment.
翻译:最近大型语言模型(LLM)的进展已经展现出令人印象深刻的生物医学问答能力,但仍未能得到充分的研究,以用于更具体的生物医学应用。本研究调查了如ChatGPT家族模型(GPT-3.5s、GPT-4)等LLM在生物医学任务中的表现超出了问答。由于不能将患者数据传递给OpenAI API公共接口,因此我们使用超过10000个样本作为代理来评估模型在临床领域的两个基本任务中的表现 - 分类和推理。第一个任务是分类,判断科学文献中的临床和政策建议陈述是否构成健康建议。第二个任务是从生物医学文献中检测因果关系。我们将LLM与更简单的模型进行了比较,例如用逻辑回归的词袋(BoW)和经过精细调整的BioBERT模型。尽管ChatGPT引起了人们的激动,但我们发现,精细调整用于两个基本的NLP任务仍然是最好的策略。简单的BoW模型的表现与最复杂的LLM相当。请求工程需要相当的投资。