ChatGPT的评估可信吗？ (Can we trust the evaluation on ChatGPT?) - 专知论文

会员服务 ·

0

ChatGPT · Continuity · Performer · MoDELS · MASS ·

2023 年 3 月 22 日

Can we trust the evaluation on ChatGPT?

翻译：ChatGPT的评估可信吗？

Rachith Aiyappa,Jisun An,Haewoon Kwak,Yong-Yeol Ahn

ChatGPT, the first large language model (LLM) with mass adoption, has demonstrated remarkable performance in numerous natural language tasks. Despite its evident usefulness, evaluating ChatGPT's performance in diverse problem domains remains challenging due to the closed nature of the model and its continuous updates via Reinforcement Learning from Human Feedback (RLHF). We highlight the issue of data contamination in ChatGPT evaluations, with a case study of the task of stance detection. We discuss the challenge of preventing data contamination and ensuring fair model evaluation in the age of closed and continuously trained models.

翻译：ChatGPT是第一个被广泛采用的大型语言模型，已在许多自然语言任务中展现出了卓越的性能。尽管它的显着实用性，由于模型的封闭性和通过人类反馈不断更新的求解增强学习（RLHF）算法，评估ChatGPT在多样化问题领域的性能仍然具有挑战性。我们重点针对ChatGPT评估中的数据污染问题进行案例研究，以stance detection任务为例。我们讨论了在封闭和不断训练模型时如何防止数据污染并确保公正的模型评估的挑战。

1

相关内容

ChatGPT

ChatGPT（全名：Chat Generative Pre-trained Transformer），美国OpenAI 研发的聊天机器人程序 [1] ，于2022年11月30日发布。ChatGPT是人工智能技术驱动的自然语言处理工具，它能够通过学习和理解人类的语言来进行对话，还能根据聊天的上下文进行互动，真正像人类一样来聊天交流，甚至能完成撰写邮件、视频脚本、文案、翻译、代码，写论文任务。 [1] https://openai.com/blog/chatgpt/

评估ChatGPT的信息提取能力:对性能、可解释性、校准和忠实度的评估

评估ChatGPT的信息提取能力:对性能、可解释性、校准和忠实度的评估

专知会员服务

74+阅读 · 2023年4月26日

从ChatGPT看AI未来趋势和挑战 | 万字长文

从ChatGPT看AI未来趋势和挑战 | 万字长文

专知会员服务

173+阅读 · 2023年4月18日

百篇论文纵览大型语言模型最新研究进展

百篇论文纵览大型语言模型最新研究进展

专知会员服务

70+阅读 · 2023年3月31日

纽约大学最新《语音识别Speech Recognition》2020课程，不可错过！

纽约大学最新《语音识别Speech Recognition》2020课程，不可错过！

专知会员服务

44+阅读 · 2020年11月2日

不可错过！UIUC最新《统计强化学习》课程！

专知会员服务

53+阅读 · 2020年9月7日

【Google可解释人工智能白皮书】27页pdf，AI Explainability Whitepaper ，Introduction to AI Explanations for AI Platform

【Google可解释人工智能白皮书】27页pdf，AI Explainability Whitepaper ，Introduction to AI Explanations for AI Platform

专知会员服务

127+阅读 · 2019年12月13日

强化学习最新教程，17页pdf

强化学习最新教程，17页pdf

专知会员服务

182+阅读 · 2019年10月11日

【人工智能在2019：一年回顾】反人工智能，AI in 2019: A Year in Review

【人工智能在2019：一年回顾】反人工智能，AI in 2019: A Year in Review

专知会员服务

79+阅读 · 2019年10月10日

【加州大学伯克利分校博士论文】通过自我监督预测学习泛化

【加州大学伯克利分校博士论文】通过自我监督预测学习泛化

专知会员服务

65+阅读 · 2019年10月9日

【SIGGRAPH2019】TensorFlow 2.0深度学习计算机图形学应用

【SIGGRAPH2019】TensorFlow 2.0深度学习计算机图形学应用

专知会员服务

41+阅读 · 2019年10月9日

VCIP 2022 Call for Demos

VCIP 2022 Call for Demos

CCF多媒体专委会

1+阅读 · 2022年6月6日

Hierarchically Structured Meta-learning

Hierarchically Structured Meta-learning

CreateAMind

27+阅读 · 2019年5月22日

逆强化学习-学习人先验的动机

逆强化学习-学习人先验的动机

CreateAMind

16+阅读 · 2019年1月18日

强化学习的Unsupervised Meta-Learning

强化学习的Unsupervised Meta-Learning

CreateAMind

18+阅读 · 2019年1月7日

Unsupervised Learning via Meta-Learning

Unsupervised Learning via Meta-Learning

CreateAMind

43+阅读 · 2019年1月3日

A Technical Overview of AI & ML in 2018 & Trends for 2019

A Technical Overview of AI & ML in 2018 & Trends for 2019

待字闺中

18+阅读 · 2018年12月24日

【论文推荐】最新八篇生成对抗网络相关论文—BRE、图像合成、多模态图像生成、非配对多域图、注意力、对抗特征增强、深度对抗性训练

【论文推荐】最新八篇生成对抗网络相关论文—BRE、图像合成、多模态图像生成、非配对多域图、注意力、对抗特征增强、深度对抗性训练

专知

16+阅读 · 2018年5月14日

【论文推荐】最新5篇图像分割（Image Segmentation）相关论文—多重假设、超像素分割、自监督、图、生成对抗网络

【论文推荐】最新5篇图像分割（Image Segmentation）相关论文—多重假设、超像素分割、自监督、图、生成对抗网络

专知

27+阅读 · 2018年2月7日

【论文推荐】最新5篇图像描述生成（Image Caption）相关论文—情感、注意力机制、遥感图像、序列到序列、深度神经结构

【论文推荐】最新5篇图像描述生成（Image Caption）相关论文—情感、注意力机制、遥感图像、序列到序列、深度神经结构

专知

66+阅读 · 2018年1月31日

【论文】变分推断（Variational inference)的总结

【论文】变分推断（Variational inference)的总结

机器学习研究会

39+阅读 · 2017年11月16日

面向空间知识服务的生态红线指标体系构建方法研究

国家自然科学基金

1+阅读 · 2015年12月31日

PVT1环状RNA在肝癌中的功能及其分子机制研究

国家自然科学基金

0+阅读 · 2015年12月31日

纵向数据因果推断中的双稳健半参数效应模型研究

国家自然科学基金

5+阅读 · 2014年12月31日

STAT/IRF-8通路在髓源性抑制细胞（MDSCs）诱导肝移植免疫耐受过程中的机制研究

国家自然科学基金

0+阅读 · 2013年12月31日

CXCL13/FAK对激素性股骨头坏死中MSCs转归的调控机制研究

国家自然科学基金

0+阅读 · 2012年12月31日

基于用户模型的移动设备可用性评估方法研究

国家自然科学基金

1+阅读 · 2012年12月31日

亚热带服役环境下FRP加固RC构件耐久性研究

国家自然科学基金

0+阅读 · 2011年12月31日

多时滞和非光滑混沌系统的控制、（反）同步及应用

国家自然科学基金

0+阅读 · 2009年12月31日

藏文字符排序研究

国家自然科学基金

0+阅读 · 2009年12月31日

多尺度扰动电离层宽带传播信道特性

国家自然科学基金

0+阅读 · 2008年12月31日

An Application of the Causal Roadmap in Two Safety Monitoring Case Studies: Covariate-Adjustment and Outcome Prediction using Electronic Health Record Data

Arxiv

0+阅读 · 2023年5月12日

GLUE-X: Evaluating Natural Language Understanding Models from an Out-of-distribution Generalization Perspective

Arxiv

0+阅读 · 2023年5月12日

The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain

Arxiv

0+阅读 · 2023年5月11日

Discovering Bugs in Vision Models using Off-the-shelf Image Generation and Captioning

Arxiv

1+阅读 · 2023年5月11日

Are ChatGPT and GPT-4 General-Purpose Solvers for Financial Text Analytics? An Examination on Several Typical Tasks

Arxiv

1+阅读 · 2023年5月10日

ToolCoder: Teach Code Generation Models to use API search tools

Arxiv

0+阅读 · 2023年5月9日

'Put the Car on the Stand': SMT-based Oracles for Investigating Decisions

Arxiv

0+阅读 · 2023年5月9日

A Comprehensive Survey of AI-Generated Content (AIGC): A History of Generative AI from GAN to ChatGPT

Arxiv

34+阅读 · 2023年3月7日

A Survey on Graph Counterfactual Explanations: Definitions, Methods, Evaluation

Arxiv

12+阅读 · 2022年10月21日

Predictive Engagement: An Efficient Metric For Automatic Evaluation of Open-Domain Dialogue Systems

Predictive Engagement: An Efficient Metric For Automatic Evaluation of Open-Domain Dialogue Systems

Arxiv

11+阅读 · 2019年11月4日

VIP会员

文章信息

相关主题

相关VIP内容

评估ChatGPT的信息提取能力:对性能、可解释性、校准和忠实度的评估

评估ChatGPT的信息提取能力:对性能、可解释性、校准和忠实度的评估

专知会员服务

74+阅读 · 2023年4月26日

从ChatGPT看AI未来趋势和挑战 | 万字长文

从ChatGPT看AI未来趋势和挑战 | 万字长文

专知会员服务

173+阅读 · 2023年4月18日

百篇论文纵览大型语言模型最新研究进展

百篇论文纵览大型语言模型最新研究进展

专知会员服务

70+阅读 · 2023年3月31日

纽约大学最新《语音识别Speech Recognition》2020课程，不可错过！

纽约大学最新《语音识别Speech Recognition》2020课程，不可错过！

专知会员服务

44+阅读 · 2020年11月2日

不可错过！UIUC最新《统计强化学习》课程！

专知会员服务

53+阅读 · 2020年9月7日

【Google可解释人工智能白皮书】27页pdf，AI Explainability Whitepaper ，Introduction to AI Explanations for AI Platform

【Google可解释人工智能白皮书】27页pdf，AI Explainability Whitepaper ，Introduction to AI Explanations for AI Platform

专知会员服务

127+阅读 · 2019年12月13日

强化学习最新教程，17页pdf

强化学习最新教程，17页pdf

专知会员服务

182+阅读 · 2019年10月11日

【人工智能在2019：一年回顾】反人工智能，AI in 2019: A Year in Review

【人工智能在2019：一年回顾】反人工智能，AI in 2019: A Year in Review

专知会员服务

79+阅读 · 2019年10月10日

【加州大学伯克利分校博士论文】通过自我监督预测学习泛化

【加州大学伯克利分校博士论文】通过自我监督预测学习泛化

专知会员服务

65+阅读 · 2019年10月9日

【SIGGRAPH2019】TensorFlow 2.0深度学习计算机图形学应用

【SIGGRAPH2019】TensorFlow 2.0深度学习计算机图形学应用

专知会员服务

41+阅读 · 2019年10月9日

热门VIP内容

开通专知VIP会员享更多权益服务

新书册《几何深度学习的数学基础》

中程单向攻击无人机的战略意义：俄乌战争启示

在无标注条件下适配视觉—语言模型：全面综述

面向视觉语言模型的持续学习：遗忘之外的综述与分类体系

相关资讯

VCIP 2022 Call for Demos

VCIP 2022 Call for Demos

CCF多媒体专委会

1+阅读 · 2022年6月6日

Hierarchically Structured Meta-learning

Hierarchically Structured Meta-learning

CreateAMind

27+阅读 · 2019年5月22日

逆强化学习-学习人先验的动机

逆强化学习-学习人先验的动机

CreateAMind

16+阅读 · 2019年1月18日

强化学习的Unsupervised Meta-Learning

强化学习的Unsupervised Meta-Learning

CreateAMind

18+阅读 · 2019年1月7日

Unsupervised Learning via Meta-Learning

Unsupervised Learning via Meta-Learning

CreateAMind

43+阅读 · 2019年1月3日

A Technical Overview of AI & ML in 2018 & Trends for 2019

A Technical Overview of AI & ML in 2018 & Trends for 2019

待字闺中

18+阅读 · 2018年12月24日

【论文推荐】最新八篇生成对抗网络相关论文—BRE、图像合成、多模态图像生成、非配对多域图、注意力、对抗特征增强、深度对抗性训练

【论文推荐】最新八篇生成对抗网络相关论文—BRE、图像合成、多模态图像生成、非配对多域图、注意力、对抗特征增强、深度对抗性训练

专知

16+阅读 · 2018年5月14日

【论文推荐】最新5篇图像分割（Image Segmentation）相关论文—多重假设、超像素分割、自监督、图、生成对抗网络

【论文推荐】最新5篇图像分割（Image Segmentation）相关论文—多重假设、超像素分割、自监督、图、生成对抗网络

专知

27+阅读 · 2018年2月7日

【论文推荐】最新5篇图像描述生成（Image Caption）相关论文—情感、注意力机制、遥感图像、序列到序列、深度神经结构

【论文推荐】最新5篇图像描述生成（Image Caption）相关论文—情感、注意力机制、遥感图像、序列到序列、深度神经结构

专知

66+阅读 · 2018年1月31日

【论文】变分推断（Variational inference)的总结

【论文】变分推断（Variational inference)的总结

机器学习研究会

39+阅读 · 2017年11月16日

相关论文

An Application of the Causal Roadmap in Two Safety Monitoring Case Studies: Covariate-Adjustment and Outcome Prediction using Electronic Health Record Data

Arxiv

0+阅读 · 2023年5月12日

GLUE-X: Evaluating Natural Language Understanding Models from an Out-of-distribution Generalization Perspective

Arxiv

0+阅读 · 2023年5月12日

The ConceptARC Benchmark: Evaluating Understanding and Generalization in the ARC Domain

Arxiv

0+阅读 · 2023年5月11日

Discovering Bugs in Vision Models using Off-the-shelf Image Generation and Captioning

Arxiv

1+阅读 · 2023年5月11日

Are ChatGPT and GPT-4 General-Purpose Solvers for Financial Text Analytics? An Examination on Several Typical Tasks

Arxiv

1+阅读 · 2023年5月10日

ToolCoder: Teach Code Generation Models to use API search tools

Arxiv

0+阅读 · 2023年5月9日

'Put the Car on the Stand': SMT-based Oracles for Investigating Decisions

Arxiv

0+阅读 · 2023年5月9日

A Comprehensive Survey of AI-Generated Content (AIGC): A History of Generative AI from GAN to ChatGPT

Arxiv

34+阅读 · 2023年3月7日

A Survey on Graph Counterfactual Explanations: Definitions, Methods, Evaluation

Arxiv

12+阅读 · 2022年10月21日

Predictive Engagement: An Efficient Metric For Automatic Evaluation of Open-Domain Dialogue Systems

Predictive Engagement: An Efficient Metric For Automatic Evaluation of Open-Domain Dialogue Systems

Arxiv

11+阅读 · 2019年11月4日

相关基金

面向空间知识服务的生态红线指标体系构建方法研究

国家自然科学基金

1+阅读 · 2015年12月31日

PVT1环状RNA在肝癌中的功能及其分子机制研究

国家自然科学基金

0+阅读 · 2015年12月31日

纵向数据因果推断中的双稳健半参数效应模型研究

国家自然科学基金

5+阅读 · 2014年12月31日

STAT/IRF-8通路在髓源性抑制细胞（MDSCs）诱导肝移植免疫耐受过程中的机制研究

国家自然科学基金

0+阅读 · 2013年12月31日

CXCL13/FAK对激素性股骨头坏死中MSCs转归的调控机制研究

国家自然科学基金

0+阅读 · 2012年12月31日

基于用户模型的移动设备可用性评估方法研究

国家自然科学基金

1+阅读 · 2012年12月31日

亚热带服役环境下FRP加固RC构件耐久性研究

国家自然科学基金

0+阅读 · 2011年12月31日

多时滞和非光滑混沌系统的控制、（反）同步及应用

国家自然科学基金

0+阅读 · 2009年12月31日

藏文字符排序研究

国家自然科学基金

0+阅读 · 2009年12月31日

多尺度扰动电离层宽带传播信道特性

国家自然科学基金

0+阅读 · 2008年12月31日

微信扫码咨询专知VIP会员