As large language models (LLMs) gain popularity among speakers of diverse languages, we believe that it is crucial to benchmark them to better understand model behaviors, failures, and limitations in languages beyond English. In this work, we evaluate LLM APIs (ChatGPT, GPT-3, and GPT-4) on the Japanese national medical licensing examinations from the past five years, including the current year. Our team comprises native Japanese-speaking NLP researchers and a practicing cardiologist based in Japan. Our experiments show that GPT-4 outperforms ChatGPT and GPT-3 and passes all six years of the exams, highlighting LLMs' potential in a language that is typologically distant from English. However, our evaluation also exposes critical limitations of the current LLM APIs. First, LLMs sometimes select prohibited choices that should be strictly avoided in medical practice in Japan, such as suggesting euthanasia. Further, our analysis shows that the API costs are generally higher and the maximum context size is smaller for Japanese because of the way non-Latin scripts are currently tokenized in the pipeline. We release our benchmark as Igaku QA as well as all model outputs and exam metadata. We hope that our results and benchmark will spur progress on more diverse applications of LLMs. Our benchmark is available at https://github.com/jungokasai/IgakuQA.
翻译:随着大型语言模型在使用不同语言的人群中越来越受欢迎,我们相信对它们进行基准测试,以更好地了解模型的行为、失败和在英语以外的语言中的限制是至关重要的。在本文中,我们评估了三个LLM API( ChatGPT、GPT-3 和 GPT-4) 在过去五年日本国家医疗执照考试以及本年度中的表现。我们的研究组由以日语为母语的自然语言处理研究人员以及一名在日本从事心脏病学的医生组成。我们的实验证明,GPT-4在日语考试中表现优异,超过了ChatGPT和GPT-3,并通过了六年的考试,突显了LLM在与英语不同语系的语言中的潜力。然而,我们的评估也暴露了当前LLM API的关键限制。首先,LLM有时会选择在日本医疗实践中应严格避免的禁止的选择,例如建议实行安乐死。此外,我们的分析显示,由于非拉丁字符在当前管道中的标记方式,日语API的成本通常更高,最大上下文大小更小。我们发布了我们的基准测试 Igaku QA,以及所有的模型输出和考试元数据。我们希望我们的结果和基准测试将促进更多多样化的LLM应用。我们的基准测试程序可在 https://github.com/jungokasai/IgakuQA 获取。