Can Large Language Models Be an Alternative to Human Evaluations? - 专知论文

会员服务 ·

0

语言模型化 · MoDELS · 道德考虑 · Performer · 样本 ·

2023 年 5 月 3 日

Can Large Language Models Be an Alternative to Human Evaluations?

翻译：暂无翻译

Cheng-Han Chiang,Hung-yi Lee

from arxiv, ACL 2023 main conference paper. Main content: 10 pages (including limitations). Appendix: 13 pages

Human evaluation is indispensable and inevitable for assessing the quality of texts generated by machine learning models or written by humans. However, human evaluation is very difficult to reproduce and its quality is notoriously unstable, hindering fair comparisons among different natural language processing (NLP) models and algorithms. Recently, large language models (LLMs) have demonstrated exceptional performance on unseen tasks when only the task instructions are provided. In this paper, we explore if such an ability of the LLMs can be used as an alternative to human evaluation. We present the LLMs with the exact same instructions, samples to be evaluated, and questions used to conduct human evaluation, and then ask the LLMs to generate responses to those questions; we dub this LLM evaluation. We use human evaluation and LLM evaluation to evaluate the texts in two NLP tasks: open-ended story generation and adversarial attacks. We show that the result of LLM evaluation is consistent with the results obtained by expert human evaluation: the texts rated higher by human experts are also rated higher by the LLMs. We also find that the results of LLM evaluation are stable over different formatting of the task instructions and the sampling algorithm used to generate the answer. We are the first to show the potential of using LLMs to assess the quality of texts and discuss the limitations and ethical considerations of LLM evaluation.

翻译：暂无翻译

0

相关内容

语言模型化

语言模型化

NLP必读经典文献100篇

专知会员服务

124+阅读 · 2020年9月8日

Aspect-Oriented Syntax Network for Aspect-Based Sentiment Analysis，中山大学数据科学与计算机学院权小军教授，第八届全国社会媒体处理大会SMP2019

Aspect-Oriented Syntax Network for Aspect-Based Sentiment Analysis，中山大学数据科学与计算机学院权小军教授，第八届全国社会媒体处理大会SMP2019

专知会员服务

19+阅读 · 2019年10月22日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日

Deep Learning Based Detection and Correction of Cardiac MR Motion Artefacts During Reconstruction for High-Quality Segmentation

Deep Learning Based Detection and Correction of Cardiac MR Motion Artefacts During Reconstruction for High-Quality Segmentation

专知会员服务

59+阅读 · 2019年10月17日

Keras François Chollet 《Deep Learning with Python 》, 386页pdf

Keras François Chollet 《Deep Learning with Python 》, 386页pdf

专知会员服务

163+阅读 · 2019年10月12日

机器学习入门的经验与建议

机器学习入门的经验与建议

专知会员服务

94+阅读 · 2019年10月10日

【人工智能在2019：一年回顾】反人工智能，AI in 2019: A Year in Review

【人工智能在2019：一年回顾】反人工智能，AI in 2019: A Year in Review

专知会员服务

79+阅读 · 2019年10月10日

【加州大学伯克利分校博士论文】通过自我监督预测学习泛化

【加州大学伯克利分校博士论文】通过自我监督预测学习泛化

专知会员服务

65+阅读 · 2019年10月9日

【哈佛大学商学院课程Fall 2019】机器学习可解释性

【哈佛大学商学院课程Fall 2019】机器学习可解释性

专知会员服务

105+阅读 · 2019年10月9日

【SIGGRAPH2019】TensorFlow 2.0深度学习计算机图形学应用

【SIGGRAPH2019】TensorFlow 2.0深度学习计算机图形学应用

专知会员服务

41+阅读 · 2019年10月9日

直播 | Interpretable and Trustworthy Graph Geometric Deep Learning

直播 | Interpretable and Trustworthy Graph Geometric Deep Learning

图与推荐

2+阅读 · 2022年11月2日

RoBERTa中文预训练模型：RoBERTa for Chinese

RoBERTa中文预训练模型：RoBERTa for Chinese

PaperWeekly

57+阅读 · 2019年9月16日

RoBERTa for Chinese：大规模中文预训练RoBERTa模型

RoBERTa for Chinese：大规模中文预训练RoBERTa模型

AINLP

30+阅读 · 2019年9月8日

Hierarchically Structured Meta-learning

Hierarchically Structured Meta-learning

CreateAMind

27+阅读 · 2019年5月22日

Transferring Knowledge across Learning Processes

Transferring Knowledge across Learning Processes

CreateAMind

29+阅读 · 2019年5月18日

Unsupervised Learning via Meta-Learning

Unsupervised Learning via Meta-Learning

CreateAMind

43+阅读 · 2019年1月3日

A Technical Overview of AI & ML in 2018 & Trends for 2019

A Technical Overview of AI & ML in 2018 & Trends for 2019

待字闺中

18+阅读 · 2018年12月24日

disentangled-representation-papers

disentangled-representation-papers

CreateAMind

26+阅读 · 2018年9月12日

【论文推荐】最新7篇聊天机器人（Chatbot）相关论文—触动你的心、DeepProbe、饮食推荐、知识学习、交互、挑战、管理

【论文推荐】最新7篇聊天机器人（Chatbot）相关论文—触动你的心、DeepProbe、饮食推荐、知识学习、交互、挑战、管理

专知

12+阅读 · 2018年3月15日

【推荐】自然语言处理（NLP）指南

【推荐】自然语言处理（NLP）指南

机器学习研究会

35+阅读 · 2017年11月17日

拟南芥FT调控开花的分子机制研究

国家自然科学基金

0+阅读 · 2015年12月31日

化疗诱导的细胞衰老在神经母细胞瘤复发中的作用及分子机制

国家自然科学基金

0+阅读 · 2014年12月31日

二维MoS2纳米片电子结构的高压调控及机理研究

国家自然科学基金

0+阅读 · 2013年12月31日

Kronheimer-Nakajima quiver 模空间与有理曲面

国家自然科学基金

1+阅读 · 2013年12月31日

冷胁迫诱导柽柳ThCAP基因表达的分子机制研究

国家自然科学基金

0+阅读 · 2012年12月31日

RI与Angiogenin相互作用调控PI3K/AKT/mTOR信号通路和ANG的核转位在膀胱癌发生发展中的机制研究

国家自然科学基金

0+阅读 · 2012年12月31日

联合携PGE2基因骨髓间充质干细胞移植重塑Kuppfer细胞表型诱导大鼠肝移植术后免疫耐受

国家自然科学基金

0+阅读 · 2011年12月31日

基于藤Copula-GARCH与时变Levy过程的多重货币期权定价的蒙特卡罗模拟方法研究

国家自然科学基金

1+阅读 · 2011年12月31日

多自由度哈密顿系统的动力学不稳定性研究

国家自然科学基金

0+阅读 · 2011年12月31日

铈基重费米子材料及钌系化合物中的磁性量子相变与非费米液体行为

国家自然科学基金

0+阅读 · 2008年12月31日

ClinicalGPT: Large Language Models Finetuned with Diverse Medical Data and Comprehensive Evaluation

ClinicalGPT: Large Language Models Finetuned with Diverse Medical Data and Comprehensive Evaluation

Arxiv

0+阅读 · 2023年6月16日

Evaluating GPT-3 Generated Explanations for Hateful Content Moderation

Arxiv

0+阅读 · 2023年6月16日

Clickbait Detection via Large Language Models

Arxiv

0+阅读 · 2023年6月16日

Human Preference Score v2: A Solid Benchmark for Evaluating Human Preferences of Text-to-Image Synthesis

Arxiv

0+阅读 · 2023年6月15日

LVLM-eHub: A Comprehensive Evaluation Benchmark for Large Vision-Language Models

Arxiv

0+阅读 · 2023年6月15日

User Simulation for Evaluating Information Access Systems

Arxiv

0+阅读 · 2023年6月14日

Perceptions and Realities of Text-to-Image Generation

Arxiv

0+阅读 · 2023年6月14日

Large Language Models are Few-Shot Summarizers: Multi-Intent Comment Generation via In-Context Learning

Arxiv

0+阅读 · 2023年6月14日

Learning When to Ask for Help: Transferring Human Knowledge through Part-Time Demonstration

Arxiv

0+阅读 · 2023年6月14日

Beyond One-Model-Fits-All: A Survey of Domain Specialization for Large Language Models

Arxiv

66+阅读 · 2023年5月31日

VIP会员

文章信息

相关主题

语言模型化

相关VIP内容

NLP必读经典文献100篇

专知会员服务

124+阅读 · 2020年9月8日

Aspect-Oriented Syntax Network for Aspect-Based Sentiment Analysis，中山大学数据科学与计算机学院权小军教授，第八届全国社会媒体处理大会SMP2019

Aspect-Oriented Syntax Network for Aspect-Based Sentiment Analysis，中山大学数据科学与计算机学院权小军教授，第八届全国社会媒体处理大会SMP2019

专知会员服务

19+阅读 · 2019年10月22日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日

Deep Learning Based Detection and Correction of Cardiac MR Motion Artefacts During Reconstruction for High-Quality Segmentation

Deep Learning Based Detection and Correction of Cardiac MR Motion Artefacts During Reconstruction for High-Quality Segmentation

专知会员服务

59+阅读 · 2019年10月17日

Keras François Chollet 《Deep Learning with Python 》, 386页pdf

Keras François Chollet 《Deep Learning with Python 》, 386页pdf

专知会员服务

163+阅读 · 2019年10月12日

机器学习入门的经验与建议

机器学习入门的经验与建议

专知会员服务

94+阅读 · 2019年10月10日

【人工智能在2019：一年回顾】反人工智能，AI in 2019: A Year in Review

【人工智能在2019：一年回顾】反人工智能，AI in 2019: A Year in Review

专知会员服务

79+阅读 · 2019年10月10日

【加州大学伯克利分校博士论文】通过自我监督预测学习泛化

【加州大学伯克利分校博士论文】通过自我监督预测学习泛化

专知会员服务

65+阅读 · 2019年10月9日

【哈佛大学商学院课程Fall 2019】机器学习可解释性

【哈佛大学商学院课程Fall 2019】机器学习可解释性

专知会员服务

105+阅读 · 2019年10月9日

【SIGGRAPH2019】TensorFlow 2.0深度学习计算机图形学应用

【SIGGRAPH2019】TensorFlow 2.0深度学习计算机图形学应用

专知会员服务

41+阅读 · 2019年10月9日

热门VIP内容

开通专知VIP会员享更多权益服务

《俄乌战争中的无人系统：新的战争方式与新兴趋势——来自前线的印象》报告

《海上自主水面船舶远程操作中心：安全可持续运行的多维度分析》

多模态大语言模型下游调优中“保持自我”的重要性

隐身自主无人水下航行器技术如何变革水下作战并重塑海军竞争

相关资讯

直播 | Interpretable and Trustworthy Graph Geometric Deep Learning

直播 | Interpretable and Trustworthy Graph Geometric Deep Learning

图与推荐

2+阅读 · 2022年11月2日

RoBERTa中文预训练模型：RoBERTa for Chinese

RoBERTa中文预训练模型：RoBERTa for Chinese

PaperWeekly

57+阅读 · 2019年9月16日

RoBERTa for Chinese：大规模中文预训练RoBERTa模型

RoBERTa for Chinese：大规模中文预训练RoBERTa模型

AINLP

30+阅读 · 2019年9月8日

Hierarchically Structured Meta-learning

Hierarchically Structured Meta-learning

CreateAMind

27+阅读 · 2019年5月22日

Transferring Knowledge across Learning Processes

Transferring Knowledge across Learning Processes

CreateAMind

29+阅读 · 2019年5月18日

Unsupervised Learning via Meta-Learning

Unsupervised Learning via Meta-Learning

CreateAMind

43+阅读 · 2019年1月3日

A Technical Overview of AI & ML in 2018 & Trends for 2019

A Technical Overview of AI & ML in 2018 & Trends for 2019

待字闺中

18+阅读 · 2018年12月24日

disentangled-representation-papers

disentangled-representation-papers

CreateAMind

26+阅读 · 2018年9月12日

【论文推荐】最新7篇聊天机器人（Chatbot）相关论文—触动你的心、DeepProbe、饮食推荐、知识学习、交互、挑战、管理

【论文推荐】最新7篇聊天机器人（Chatbot）相关论文—触动你的心、DeepProbe、饮食推荐、知识学习、交互、挑战、管理

专知

12+阅读 · 2018年3月15日

【推荐】自然语言处理（NLP）指南

【推荐】自然语言处理（NLP）指南

机器学习研究会

35+阅读 · 2017年11月17日

相关论文

ClinicalGPT: Large Language Models Finetuned with Diverse Medical Data and Comprehensive Evaluation

ClinicalGPT: Large Language Models Finetuned with Diverse Medical Data and Comprehensive Evaluation

Arxiv

0+阅读 · 2023年6月16日

Evaluating GPT-3 Generated Explanations for Hateful Content Moderation

Arxiv

0+阅读 · 2023年6月16日

Clickbait Detection via Large Language Models

Arxiv

0+阅读 · 2023年6月16日

Human Preference Score v2: A Solid Benchmark for Evaluating Human Preferences of Text-to-Image Synthesis

Arxiv

0+阅读 · 2023年6月15日

LVLM-eHub: A Comprehensive Evaluation Benchmark for Large Vision-Language Models

Arxiv

0+阅读 · 2023年6月15日

User Simulation for Evaluating Information Access Systems

Arxiv

0+阅读 · 2023年6月14日

Perceptions and Realities of Text-to-Image Generation

Arxiv

0+阅读 · 2023年6月14日

Large Language Models are Few-Shot Summarizers: Multi-Intent Comment Generation via In-Context Learning

Arxiv

0+阅读 · 2023年6月14日

Learning When to Ask for Help: Transferring Human Knowledge through Part-Time Demonstration

Arxiv

0+阅读 · 2023年6月14日

Beyond One-Model-Fits-All: A Survey of Domain Specialization for Large Language Models

Arxiv

66+阅读 · 2023年5月31日

相关基金

拟南芥FT调控开花的分子机制研究

国家自然科学基金

0+阅读 · 2015年12月31日

化疗诱导的细胞衰老在神经母细胞瘤复发中的作用及分子机制

国家自然科学基金

0+阅读 · 2014年12月31日

二维MoS2纳米片电子结构的高压调控及机理研究

国家自然科学基金

0+阅读 · 2013年12月31日

Kronheimer-Nakajima quiver 模空间与有理曲面

国家自然科学基金

1+阅读 · 2013年12月31日

冷胁迫诱导柽柳ThCAP基因表达的分子机制研究

国家自然科学基金

0+阅读 · 2012年12月31日

RI与Angiogenin相互作用调控PI3K/AKT/mTOR信号通路和ANG的核转位在膀胱癌发生发展中的机制研究

国家自然科学基金

0+阅读 · 2012年12月31日

联合携PGE2基因骨髓间充质干细胞移植重塑Kuppfer细胞表型诱导大鼠肝移植术后免疫耐受

国家自然科学基金

0+阅读 · 2011年12月31日

基于藤Copula-GARCH与时变Levy过程的多重货币期权定价的蒙特卡罗模拟方法研究

国家自然科学基金

1+阅读 · 2011年12月31日

多自由度哈密顿系统的动力学不稳定性研究

国家自然科学基金

0+阅读 · 2011年12月31日

铈基重费米子材料及钌系化合物中的磁性量子相变与非费米液体行为

国家自然科学基金

0+阅读 · 2008年12月31日

微信扫码咨询专知VIP会员