As large language models (LLMs) gain popularity among speakers of diverse languages, we believe that it is crucial to benchmark them to better understand model behaviors, failures, and limitations in languages beyond English. In this work, we evaluate LLM APIs (ChatGPT, GPT-3, and GPT-4) on the Japanese national medical licensing examinations from the past five years. Our team comprises native Japanese-speaking NLP researchers and a practicing cardiologist based in Japan. Our experiments show that GPT-4 outperforms ChatGPT and GPT-3 and passes all five years of the exams, highlighting LLMs' potential in a language that is typologically distant from English. However, our evaluation also exposes critical limitations of the current LLM APIs. First, LLMs sometimes select prohibited choices that should be strictly avoided in medical practice in Japan, such as suggesting euthanasia. Further, our analysis shows that the API costs are generally higher and the maximum context size is smaller for Japanese because of the way non-Latin scripts are currently tokenized in the pipeline. We release our benchmark as Igaku QA as well as all model outputs and exam metadata. We hope that our results and benchmark will spur progress on more diverse applications of LLMs. Our benchmark is available at https://github.com/jungokasai/IgakuQA.
翻译:随着大型语言模型(LLMs)在不同语种的使用者中变得越来越受欢迎,我们相信对它们进行基准测试以更好地理解模型在非英语语种中的行为、失灵和局限性是至关重要的。在这项工作中,我们评估了LLM API(ChatGPT、GPT-3和GPT-4)在过去五年的日本国家医疗执业考试中的表现。我们的团队由母语为日语的NLP研究人员和驻日本的实践心脏病医生组成。我们的实验表明,GPT-4的表现优于ChatGPT和GPT-3,并通过了所有五年的考试,突显了LLMs在与英语有巨大差距的语言中的潜力。但是,我们的评估也揭示了当前LLM APIs的关键局限。首先,LLMs有时会选择在日本医疗实践中严格避免的禁止选项,例如建议实施安乐死。此外,我们的分析显示,由于非拉丁脚本在管道中的标记方式,日本语的API成本通常较高,最大上下文大小较小。我们发布我们的基准测试作为“Igaku QA”,以及所有模型输出和考试元数据。我们希望我们的结果和基准测试将促进LLMs更多种的应用的进展。我们的基准测试可以在 https://github.com/jungokasai/IgakuQA 上找到。