评估 GPT-4 和 ChatGPT 在日本医疗执照考试中的表现 (Evaluating GPT-4 and ChatGPT on Japanese Medical Licensing Examinations) - 专知论文

会员服务 ·

0

基准测试 · GPT-4 · 日本 · 基准 · ChatGPT ·

2023 年 4 月 5 日

Evaluating GPT-4 and ChatGPT on Japanese Medical Licensing Examinations

翻译：评估 GPT-4 和 ChatGPT 在日本医疗执照考试中的表现

Jungo Kasai,Yuhei Kasai,Keisuke Sakaguchi,Yutaro Yamada,Dragomir Radev

from arxiv, Added results from the March 2023 exam

As large language models (LLMs) gain popularity among speakers of diverse languages, we believe that it is crucial to benchmark them to better understand model behaviors, failures, and limitations in languages beyond English. In this work, we evaluate LLM APIs (ChatGPT, GPT-3, and GPT-4) on the Japanese national medical licensing examinations from the past five years, including the current year. Our team comprises native Japanese-speaking NLP researchers and a practicing cardiologist based in Japan. Our experiments show that GPT-4 outperforms ChatGPT and GPT-3 and passes all six years of the exams, highlighting LLMs' potential in a language that is typologically distant from English. However, our evaluation also exposes critical limitations of the current LLM APIs. First, LLMs sometimes select prohibited choices that should be strictly avoided in medical practice in Japan, such as suggesting euthanasia. Further, our analysis shows that the API costs are generally higher and the maximum context size is smaller for Japanese because of the way non-Latin scripts are currently tokenized in the pipeline. We release our benchmark as Igaku QA as well as all model outputs and exam metadata. We hope that our results and benchmark will spur progress on more diverse applications of LLMs. Our benchmark is available at https://github.com/jungokasai/IgakuQA.

翻译：随着大型语言模型在使用不同语言的人群中越来越受欢迎，我们相信对它们进行基准测试，以更好地了解模型的行为、失败和在英语以外的语言中的限制是至关重要的。在本文中，我们评估了三个LLM API( ChatGPT、GPT-3 和 GPT-4) 在过去五年日本国家医疗执照考试以及本年度中的表现。我们的研究组由以日语为母语的自然语言处理研究人员以及一名在日本从事心脏病学的医生组成。我们的实验证明，GPT-4在日语考试中表现优异，超过了ChatGPT和GPT-3，并通过了六年的考试，突显了LLM在与英语不同语系的语言中的潜力。然而，我们的评估也暴露了当前LLM API的关键限制。首先，LLM有时会选择在日本医疗实践中应严格避免的禁止的选择，例如建议实行安乐死。此外，我们的分析显示，由于非拉丁字符在当前管道中的标记方式，日语API的成本通常更高，最大上下文大小更小。我们发布了我们的基准测试 Igaku QA，以及所有的模型输出和考试元数据。我们希望我们的结果和基准测试将促进更多多样化的LLM应用。我们的基准测试程序可在 https://github.com/jungokasai/IgakuQA 获取。

0

相关内容

基准测试

基准测试是指通过设计科学的测试方法、测试工具和测试系统，实现对一类测试对象的某项性能指标进行定量的和可对比的测试。

大模型如何用好？亚马逊最新《大型语言模型(LLMs)实践：ChatGPT》综述，全面概述LLM模型、数据、任务的实战指南

大模型如何用好？亚马逊最新《大型语言模型(LLMs)实践：ChatGPT》综述，全面概述LLM模型、数据、任务的实战指南

专知会员服务

139+阅读 · 2023年4月27日

评估ChatGPT的信息提取能力:对性能、可解释性、校准和忠实度的评估

评估ChatGPT的信息提取能力:对性能、可解释性、校准和忠实度的评估

专知会员服务

74+阅读 · 2023年4月26日

ChatGPT和GPT-4的逻辑推理如何？浙大等最新《ChatGPT和GPT-4逻辑推理能力全面评测》论文解答，常规优异新数据差

ChatGPT和GPT-4的逻辑推理如何？浙大等最新《ChatGPT和GPT-4逻辑推理能力全面评测》论文解答，常规优异新数据差

专知会员服务

65+阅读 · 2023年4月19日

194篇文献调研ChatGPT最新研究进展！最新《ChatGPT/GPT-4研究综述及对大型语言模型未来的展望》国内外研究者编著

194篇文献调研ChatGPT最新研究进展！最新《ChatGPT/GPT-4研究综述及对大型语言模型未来的展望》国内外研究者编著

专知会员服务

148+阅读 · 2023年4月7日

百篇论文纵览大型语言模型最新研究进展

百篇论文纵览大型语言模型最新研究进展

专知会员服务

70+阅读 · 2023年3月31日

【2022新书】高效深度学习，Efficient Deep Learning Book

【2022新书】高效深度学习，Efficient Deep Learning Book

专知会员服务

125+阅读 · 2022年4月21日

【USC-Aaron Chan博士答辩Slides】可信自然语言处理机器解释的生成与利用, 242页ppt，Generating and Utilizing Machine Explanations for Trustworthy NLP

【USC-Aaron Chan博士答辩Slides】可信自然语言处理机器解释的生成与利用, 242页ppt，Generating and Utilizing Machine Explanations for Trustworthy NLP

专知会员服务

16+阅读 · 2022年3月13日

【Google】深度学习对抗鲁棒性，43页ppt

专知会员服务

45+阅读 · 2020年10月31日

机器学习入门的经验与建议

机器学习入门的经验与建议

专知会员服务

94+阅读 · 2019年10月10日

【哈佛大学商学院课程Fall 2019】机器学习可解释性

【哈佛大学商学院课程Fall 2019】机器学习可解释性

专知会员服务

105+阅读 · 2019年10月9日

194篇文献调研ChatGPT最新研究进展！最新《ChatGPT/GPT-4研究综述及对大型语言模型未来的展望》国内外研究者编著

194篇文献调研ChatGPT最新研究进展！最新《ChatGPT/GPT-4研究综述及对大型语言模型未来的展望》国内外研究者编著

专知

25+阅读 · 2023年4月7日

VCIP 2022 Call for Demos

VCIP 2022 Call for Demos

CCF多媒体专委会

1+阅读 · 2022年6月6日

RoBERTa中文预训练模型：RoBERTa for Chinese

RoBERTa中文预训练模型：RoBERTa for Chinese

PaperWeekly

57+阅读 · 2019年9月16日

BERT/注意力机制/Transformer/迁移学习NLP资源大列表：awesome-bert-nlp

BERT/注意力机制/Transformer/迁移学习NLP资源大列表：awesome-bert-nlp

AINLP

40+阅读 · 2019年6月9日

Hierarchically Structured Meta-learning

Hierarchically Structured Meta-learning

CreateAMind

27+阅读 · 2019年5月22日

大数据 | 顶级SCI期刊专刊/国际会议信息7条

大数据 | 顶级SCI期刊专刊/国际会议信息7条

Call4Papers

10+阅读 · 2018年12月29日

A Technical Overview of AI & ML in 2018 & Trends for 2019

A Technical Overview of AI & ML in 2018 & Trends for 2019

待字闺中

18+阅读 · 2018年12月24日

【论文推荐】最新六篇推荐系统相关论文—注意力机制、多任务、协同跨网络、非结构化文本、TransRev、章节推荐

【论文推荐】最新六篇推荐系统相关论文—注意力机制、多任务、协同跨网络、非结构化文本、TransRev、章节推荐

专知

12+阅读 · 2018年4月26日

【推荐】RNN最新研究进展综述

【推荐】RNN最新研究进展综述

机器学习研究会

26+阅读 · 2018年1月6日

【推荐】图像分类必读开创性论文汇总

【推荐】图像分类必读开创性论文汇总

机器学习研究会

14+阅读 · 2017年8月15日

基于微流控芯片平台研究CXCL12-CXCR4信号对乳腺癌干细胞转移的影响

国家自然科学基金

0+阅读 · 2015年12月31日

肺循环肿瘤细胞分子表型鉴定

国家自然科学基金

0+阅读 · 2014年12月31日

S3AGA样本（Spitzer-SDSS Spectral Atlas of Galaxies and AGNs)及其AGN研究

国家自然科学基金

0+阅读 · 2014年12月31日

铜铁矿结构CuXO2(X=Y,Sc,Al)红外透明导电薄膜的光电性能和择优取向制备研究

国家自然科学基金

0+阅读 · 2013年12月31日

语音信号声纹信息成分的深层表达

国家自然科学基金

0+阅读 · 2012年12月31日

CyberKnife剂量校准与验证的系统研究

国家自然科学基金

0+阅读 · 2012年12月31日

实时安全关键系统的建模、仿真与验证

国家自然科学基金

1+阅读 · 2012年12月31日

VI和VIA族金属元素掺杂对TiAl基合金力学性能影响机制的理论研究

国家自然科学基金

0+阅读 · 2011年12月31日

sRAGE对缺血/再灌注的心脏保护作用及其机制

国家自然科学基金

0+阅读 · 2008年12月31日

c-Myc及Cyclin A2诱导豚鼠耳蜗前体细胞增殖的实验研究

国家自然科学基金

0+阅读 · 2008年12月31日

Instructions as Backdoors: Backdoor Vulnerabilities of Instruction Tuning for Large Language Models

Arxiv

0+阅读 · 2023年5月24日

Carving UI Tests to Generate API Tests and API Specification

Arxiv

0+阅读 · 2023年5月24日

Evaluating ChatGPT's Performance for Multilingual and Emoji-based Hate Speech Detection

Arxiv

0+阅读 · 2023年5月22日

SCITAB: A Challenging Benchmark for Compositional Reasoning and Claim Verification on Scientific Tables

Arxiv

0+阅读 · 2023年5月22日

Evaluating and Enhancing Structural Understanding Capabilities of Large Language Models on Tables via Input Designs

Arxiv

0+阅读 · 2023年5月22日

RWKV: Reinventing RNNs for the Transformer Era

Arxiv

3+阅读 · 2023年5月22日

Can Large Language Models emulate an inductive Thematic Analysis of semi-structured interviews? An exploration and provocation on the limits of the approach and the model

Arxiv

0+阅读 · 2023年5月22日

Is GPT-3 all you need for Visual Question Answering in Cultural Heritage?

Arxiv

0+阅读 · 2023年5月19日

ChatGPT is a Knowledgeable but Inexperienced Solver: An Investigation of Commonsense Problem in Large Language Models

Arxiv

61+阅读 · 2023年3月29日

Pretrained Transformers for Text Ranking: BERT and Beyond

Arxiv

28+阅读 · 2020年10月13日

VIP会员

文章信息

相关主题

相关VIP内容

大模型如何用好？亚马逊最新《大型语言模型(LLMs)实践：ChatGPT》综述，全面概述LLM模型、数据、任务的实战指南

大模型如何用好？亚马逊最新《大型语言模型(LLMs)实践：ChatGPT》综述，全面概述LLM模型、数据、任务的实战指南

专知会员服务

139+阅读 · 2023年4月27日

评估ChatGPT的信息提取能力:对性能、可解释性、校准和忠实度的评估

评估ChatGPT的信息提取能力:对性能、可解释性、校准和忠实度的评估

专知会员服务

74+阅读 · 2023年4月26日

ChatGPT和GPT-4的逻辑推理如何？浙大等最新《ChatGPT和GPT-4逻辑推理能力全面评测》论文解答，常规优异新数据差

ChatGPT和GPT-4的逻辑推理如何？浙大等最新《ChatGPT和GPT-4逻辑推理能力全面评测》论文解答，常规优异新数据差

专知会员服务

65+阅读 · 2023年4月19日

194篇文献调研ChatGPT最新研究进展！最新《ChatGPT/GPT-4研究综述及对大型语言模型未来的展望》国内外研究者编著

194篇文献调研ChatGPT最新研究进展！最新《ChatGPT/GPT-4研究综述及对大型语言模型未来的展望》国内外研究者编著

专知会员服务

148+阅读 · 2023年4月7日

百篇论文纵览大型语言模型最新研究进展

百篇论文纵览大型语言模型最新研究进展

专知会员服务

70+阅读 · 2023年3月31日

【2022新书】高效深度学习，Efficient Deep Learning Book

【2022新书】高效深度学习，Efficient Deep Learning Book

专知会员服务

125+阅读 · 2022年4月21日

【USC-Aaron Chan博士答辩Slides】可信自然语言处理机器解释的生成与利用, 242页ppt，Generating and Utilizing Machine Explanations for Trustworthy NLP

【USC-Aaron Chan博士答辩Slides】可信自然语言处理机器解释的生成与利用, 242页ppt，Generating and Utilizing Machine Explanations for Trustworthy NLP

专知会员服务

16+阅读 · 2022年3月13日

【Google】深度学习对抗鲁棒性，43页ppt

专知会员服务

45+阅读 · 2020年10月31日

机器学习入门的经验与建议

机器学习入门的经验与建议

专知会员服务

94+阅读 · 2019年10月10日

【哈佛大学商学院课程Fall 2019】机器学习可解释性

【哈佛大学商学院课程Fall 2019】机器学习可解释性

专知会员服务

105+阅读 · 2019年10月9日

热门VIP内容

开通专知VIP会员享更多权益服务

《复杂工程系统模型驱动设计决策支持系统：早期设计阶段挑战》最新138页

《日本陆上自卫队2040年作战方式与未来作战研究》最新23页slides

人工智能作为战争武器

《后勤保障》最新23页

相关资讯

194篇文献调研ChatGPT最新研究进展！最新《ChatGPT/GPT-4研究综述及对大型语言模型未来的展望》国内外研究者编著

194篇文献调研ChatGPT最新研究进展！最新《ChatGPT/GPT-4研究综述及对大型语言模型未来的展望》国内外研究者编著

专知

25+阅读 · 2023年4月7日

VCIP 2022 Call for Demos

VCIP 2022 Call for Demos

CCF多媒体专委会

1+阅读 · 2022年6月6日

RoBERTa中文预训练模型：RoBERTa for Chinese

RoBERTa中文预训练模型：RoBERTa for Chinese

PaperWeekly

57+阅读 · 2019年9月16日

BERT/注意力机制/Transformer/迁移学习NLP资源大列表：awesome-bert-nlp

BERT/注意力机制/Transformer/迁移学习NLP资源大列表：awesome-bert-nlp

AINLP

40+阅读 · 2019年6月9日

Hierarchically Structured Meta-learning

Hierarchically Structured Meta-learning

CreateAMind

27+阅读 · 2019年5月22日

大数据 | 顶级SCI期刊专刊/国际会议信息7条

大数据 | 顶级SCI期刊专刊/国际会议信息7条

Call4Papers

10+阅读 · 2018年12月29日

A Technical Overview of AI & ML in 2018 & Trends for 2019

A Technical Overview of AI & ML in 2018 & Trends for 2019

待字闺中

18+阅读 · 2018年12月24日

【论文推荐】最新六篇推荐系统相关论文—注意力机制、多任务、协同跨网络、非结构化文本、TransRev、章节推荐

【论文推荐】最新六篇推荐系统相关论文—注意力机制、多任务、协同跨网络、非结构化文本、TransRev、章节推荐

专知

12+阅读 · 2018年4月26日

【推荐】RNN最新研究进展综述

【推荐】RNN最新研究进展综述

机器学习研究会

26+阅读 · 2018年1月6日

【推荐】图像分类必读开创性论文汇总

【推荐】图像分类必读开创性论文汇总

机器学习研究会

14+阅读 · 2017年8月15日

相关论文

Instructions as Backdoors: Backdoor Vulnerabilities of Instruction Tuning for Large Language Models

Arxiv

0+阅读 · 2023年5月24日

Carving UI Tests to Generate API Tests and API Specification

Arxiv

0+阅读 · 2023年5月24日

Evaluating ChatGPT's Performance for Multilingual and Emoji-based Hate Speech Detection

Arxiv

0+阅读 · 2023年5月22日

SCITAB: A Challenging Benchmark for Compositional Reasoning and Claim Verification on Scientific Tables

Arxiv

0+阅读 · 2023年5月22日

Evaluating and Enhancing Structural Understanding Capabilities of Large Language Models on Tables via Input Designs

Arxiv

0+阅读 · 2023年5月22日

RWKV: Reinventing RNNs for the Transformer Era

Arxiv

3+阅读 · 2023年5月22日

Can Large Language Models emulate an inductive Thematic Analysis of semi-structured interviews? An exploration and provocation on the limits of the approach and the model

Arxiv

0+阅读 · 2023年5月22日

Is GPT-3 all you need for Visual Question Answering in Cultural Heritage?

Arxiv

0+阅读 · 2023年5月19日

ChatGPT is a Knowledgeable but Inexperienced Solver: An Investigation of Commonsense Problem in Large Language Models

Arxiv

61+阅读 · 2023年3月29日

Pretrained Transformers for Text Ranking: BERT and Beyond

Arxiv

28+阅读 · 2020年10月13日

相关基金

基于微流控芯片平台研究CXCL12-CXCR4信号对乳腺癌干细胞转移的影响

国家自然科学基金

0+阅读 · 2015年12月31日

肺循环肿瘤细胞分子表型鉴定

国家自然科学基金

0+阅读 · 2014年12月31日

S3AGA样本（Spitzer-SDSS Spectral Atlas of Galaxies and AGNs)及其AGN研究

国家自然科学基金

0+阅读 · 2014年12月31日

铜铁矿结构CuXO2(X=Y,Sc,Al)红外透明导电薄膜的光电性能和择优取向制备研究

国家自然科学基金

0+阅读 · 2013年12月31日

语音信号声纹信息成分的深层表达

国家自然科学基金

0+阅读 · 2012年12月31日

CyberKnife剂量校准与验证的系统研究

国家自然科学基金

0+阅读 · 2012年12月31日

实时安全关键系统的建模、仿真与验证

国家自然科学基金

1+阅读 · 2012年12月31日

VI和VIA族金属元素掺杂对TiAl基合金力学性能影响机制的理论研究

国家自然科学基金

0+阅读 · 2011年12月31日

sRAGE对缺血/再灌注的心脏保护作用及其机制

国家自然科学基金

0+阅读 · 2008年12月31日

c-Myc及Cyclin A2诱导豚鼠耳蜗前体细胞增殖的实验研究

国家自然科学基金

0+阅读 · 2008年12月31日

微信扫码咨询专知VIP会员