Introduction: Traditional Korean medicine (TKM) emphasizes individualized diagnosis and treatment, making AI modeling difficult due to limited data and implicit processes. GPT-3.5 and GPT-4, large language models, have shown impressive medical knowledge despite lacking medicine-specific training. This study aimed to assess the capabilities of GPT-3.5 and GPT-4 for TKM using the Korean National Licensing Examination for Korean Medicine Doctors. Methods: GPT-3.5 (February 2023) and GPT-4 (March 2023) models answered 340 questions from the 2022 examination across 12 subjects. Each question was independently evaluated five times in an initialized session. Results: GPT-3.5 and GPT-4 achieved 42.06% and 57.29% accuracy, respectively, with GPT-4 nearing passing performance. There were significant differences in accuracy by subjects, with 83.75% accuracy for neuropsychiatry compared to 28.75% for internal medicine (2). Both models showed high accuracy in recall-based and diagnosis-based questions but struggled with intervention-based ones. The accuracy for questions that require TKM-specialized knowledge was relatively lower than the accuracy for questions that do not GPT-4 showed high accuracy for table-based questions, and both models demonstrated consistent responses. A positive correlation between consistency and accuracy was observed. Conclusion: Models in this study showed near-passing performance in decision-making for TKM without domain-specific training. However, limits were also observed that were believed to be caused by culturally-biased learning. Our study suggests that foundation models have potential in culturally-adapted medicine, specifically TKM, for clinical assistance, medical education, and medical research.
翻译:引言:韩国传统医学 (TKM)强调个体化诊断和治疗,使得AI建模由于数据有限和隐式过程而变得困难。GPT-3.5和GPT-4是大型语言模型,在缺乏医学特定培训的情况下显示出令人印象深刻的医学知识。本研究旨在评估GPT-3.5和GPT-4用于TKM的能力,使用韩国国家执照考试 ( Korean National Licensing Examination for Korean Medicine Doctors )。方法:GPT-3.5 (2023年2月)和 GPT-4 (2023年3月)模型回答了2022年考试的12个科目中的340道题目。每道题目在一个初始化的会话中独立评估五次。结果:GPT-3.5和GPT-4分别达到了42.06%和57.29%的准确率,GPT-4接近及格的表现。不同科目的准确率存在显著差异,神经精神科的准确率为83.75%,内科的准确率为28.75%(2)。两个模型在基于回忆和基于诊断的问题上都表现出高准确度,但在基于干预的问题上表现出困难。需要TKM专业知识的问题相对于不需要TKM专业知识的问题准确率较低,而GPT-4在基于表格的问题上表现出很高的准确率,两个模型都展现了一致的响应。 一个一致性和准确性之间的正相关性被观察到。结论:本研究中的模型在没有领域特定培训的情况下,显示出在TKM决策方面几乎及格的表现。然而,也观察到由于文化偏见学习而导致的限制。我们的研究表明,基础模型在文化适应性医学,特别是TKM领域,为临床援助、医学教育和医学研究具有潜力。