HELMA: A Large-Scale Hallucination Evaluation Benchmark for Large Language Models - 专知论文

会员服务 ·

0

语言模型化 · Performer · MoDELS · ChatGPT · 知识 (knowledge) ·

2023 年 5 月 19 日

HELMA: A Large-Scale Hallucination Evaluation Benchmark for Large Language Models

翻译：暂无翻译

Junyi Li,Xiaoxue Cheng,Wayne Xin Zhao,Jian-Yun Nie,Ji-Rong Wen

from arxiv, Working in progress

Large language models (LLMs), such as ChatGPT, are prone to generate hallucinations, \ie content that conflicts with the source or cannot be verified by the factual knowledge. To understand what types of content and to which extent LLMs are apt to hallucinate, we introduce the Hallucination Evaluation for Large Language Models (HELMA) benchmark, a large collection of generated and human-annotated hallucinated samples for evaluating the performance of LLMs in recognizing and alleviating hallucination. To generate these samples, we propose a ChatGPT-based two-step framework, \ie sampling-then-filtering. Specifically, we first adopt two different sampling methods to generate hallucinated samples based on instructions, and then use an example-enhanced filtering method to select the best one. Furthermore, we also hire some human labelers to annotate the hallucinations in ChatGPT responses. The empirical results suggest that ChatGPT has some probabilities to generate hallucinations and existing LLMs face great challenges in recognizing the hallucinations in text. In addition, the performance can be improved by providing external knowledge or adding reasoning steps. Our benchmark can be accessed at https://github.com/RUCAIBox/HELMA.

翻译：暂无翻译

0

相关内容

语言模型化

语言模型化

百篇论文纵览大型语言模型最新研究进展

百篇论文纵览大型语言模型最新研究进展

专知会员服务

70+阅读 · 2023年3月31日

高效可扩展图神经网络的研究进展，Recent Advances in Efficient and Scalable Graph Neural Networks

高效可扩展图神经网络的研究进展，Recent Advances in Efficient and Scalable Graph Neural Networks

专知会员服务

78+阅读 · 2022年3月15日

50+篇《神经架构搜索NAS》2020论文合集

专知会员服务

61+阅读 · 2020年3月19日

Deep Learning Based Detection and Correction of Cardiac MR Motion Artefacts During Reconstruction for High-Quality Segmentation

Deep Learning Based Detection and Correction of Cardiac MR Motion Artefacts During Reconstruction for High-Quality Segmentation

专知会员服务

59+阅读 · 2019年10月17日

Keras François Chollet 《Deep Learning with Python 》, 386页pdf

Keras François Chollet 《Deep Learning with Python 》, 386页pdf

专知会员服务

163+阅读 · 2019年10月12日

强化学习最新教程，17页pdf

强化学习最新教程，17页pdf

专知会员服务

182+阅读 · 2019年10月11日

[综述]深度学习下的场景文本检测与识别

[综述]深度学习下的场景文本检测与识别

专知会员服务

78+阅读 · 2019年10月10日

【人工智能在2019：一年回顾】反人工智能，AI in 2019: A Year in Review

【人工智能在2019：一年回顾】反人工智能，AI in 2019: A Year in Review

专知会员服务

79+阅读 · 2019年10月10日

【CMU卡内基梅隆大学】深度学习在计算机视觉的应用：方法，解释，因果与公平性

【CMU卡内基梅隆大学】深度学习在计算机视觉的应用：方法，解释，因果与公平性

专知会员服务

83+阅读 · 2019年10月9日

【哈佛大学商学院课程Fall 2019】机器学习可解释性

【哈佛大学商学院课程Fall 2019】机器学习可解释性

专知会员服务

105+阅读 · 2019年10月9日

GNN 新基准！Long Range Graph Benchmark

GNN 新基准！Long Range Graph Benchmark

图与推荐

0+阅读 · 2022年10月18日

Hierarchically Structured Meta-learning

Hierarchically Structured Meta-learning

CreateAMind

27+阅读 · 2019年5月22日

Transferring Knowledge across Learning Processes

Transferring Knowledge across Learning Processes

CreateAMind

29+阅读 · 2019年5月18日

强化学习的Unsupervised Meta-Learning

强化学习的Unsupervised Meta-Learning

CreateAMind

18+阅读 · 2019年1月7日

Unsupervised Learning via Meta-Learning

Unsupervised Learning via Meta-Learning

CreateAMind

43+阅读 · 2019年1月3日

A Technical Overview of AI & ML in 2018 & Trends for 2019

A Technical Overview of AI & ML in 2018 & Trends for 2019

待字闺中

18+阅读 · 2018年12月24日

【代码资源】GAN | 七份最热GAN文章及代码分享（Github 1000+Stars）

【代码资源】GAN | 七份最热GAN文章及代码分享（Github 1000+Stars）

专知

13+阅读 · 2018年6月24日

【论文推荐】最新八篇生成对抗网络相关论文—BRE、图像合成、多模态图像生成、非配对多域图、注意力、对抗特征增强、深度对抗性训练

【论文推荐】最新八篇生成对抗网络相关论文—BRE、图像合成、多模态图像生成、非配对多域图、注意力、对抗特征增强、深度对抗性训练

专知

16+阅读 · 2018年5月14日

【论文推荐】最新5篇图像描述生成（Image Caption）相关论文—情感、注意力机制、遥感图像、序列到序列、深度神经结构

【论文推荐】最新5篇图像描述生成（Image Caption）相关论文—情感、注意力机制、遥感图像、序列到序列、深度神经结构

专知

66+阅读 · 2018年1月31日

【论文】变分推断（Variational inference)的总结

【论文】变分推断（Variational inference)的总结

机器学习研究会

39+阅读 · 2017年11月16日

p53基因突变促进Wilms 肿瘤发展转移的小鼠动物模型研究

国家自然科学基金

0+阅读 · 2014年12月31日

Delta-Sarcoglycan基因的两个新突变在东亚人遗传性心肌病中的致病作用及其机理

国家自然科学基金

0+阅读 · 2014年12月31日

常染色体隐性遗传小脑性共济失调新的致病基因CAX的功能研究

国家自然科学基金

0+阅读 · 2014年12月31日

近相变温度下钛表面融盐电解法渗硼的渗层强化生长机制

国家自然科学基金

0+阅读 · 2013年12月31日

p75NTR基因rs2072446多态性所致错义突变在阿尔茨海默病发生中的作用及机制研究

国家自然科学基金

0+阅读 · 2013年12月31日

大行程、高分辨率直线型超声电机及其驱动的精密运动平台的研究

国家自然科学基金

0+阅读 · 2012年12月31日

胆固醇酯转运蛋白在息肉状脉络膜血管病变发病中的作用机制研究

国家自然科学基金

0+阅读 · 2012年12月31日

BRR2蛋白突变导致视网膜色素变性发病机制的研究

国家自然科学基金

0+阅读 · 2011年12月31日

维生素D及其受体基因多态性与高血压发病的研究

国家自然科学基金

0+阅读 · 2009年12月31日

sRAGE对缺血/再灌注的心脏保护作用及其机制

国家自然科学基金

0+阅读 · 2008年12月31日

Evaluating Open-Domain Question Answering in the Era of Large Language Models

Arxiv

0+阅读 · 2023年7月6日

KoLA: Carefully Benchmarking World Knowledge of Large Language Models

Arxiv

0+阅读 · 2023年7月6日

A Survey on Evaluation of Large Language Models

Arxiv

0+阅读 · 2023年7月6日

Style Over Substance: Evaluation Biases for Large Language Models

Arxiv

0+阅读 · 2023年7月6日

A Systematic Study and Comprehensive Evaluation of ChatGPT on Benchmark Datasets

Arxiv

0+阅读 · 2023年7月5日

Embodied Task Planning with Large Language Models

Arxiv

1+阅读 · 2023年7月4日

CARE-MI: Chinese Benchmark for Misinformation Evaluation in Maternity and Infant Care

Arxiv

0+阅读 · 2023年7月4日

A Critical Re-evaluation of Benchmark Datasets for (Deep) Learning-Based Matching Algorithms

Arxiv

0+阅读 · 2023年7月3日

Large-scale Multi-Modal Pre-trained Models: A Comprehensive Survey

Arxiv

25+阅读 · 2023年2月20日

DOTA: A Large-scale Dataset for Object Detection in Aerial Images

Arxiv

19+阅读 · 2018年1月27日

VIP会员

文章信息

相关主题

语言模型化

知识 (knowledge)

相关VIP内容

百篇论文纵览大型语言模型最新研究进展

百篇论文纵览大型语言模型最新研究进展

专知会员服务

70+阅读 · 2023年3月31日

高效可扩展图神经网络的研究进展，Recent Advances in Efficient and Scalable Graph Neural Networks

高效可扩展图神经网络的研究进展，Recent Advances in Efficient and Scalable Graph Neural Networks

专知会员服务

78+阅读 · 2022年3月15日

50+篇《神经架构搜索NAS》2020论文合集

专知会员服务

61+阅读 · 2020年3月19日

Deep Learning Based Detection and Correction of Cardiac MR Motion Artefacts During Reconstruction for High-Quality Segmentation

Deep Learning Based Detection and Correction of Cardiac MR Motion Artefacts During Reconstruction for High-Quality Segmentation

专知会员服务

59+阅读 · 2019年10月17日

Keras François Chollet 《Deep Learning with Python 》, 386页pdf

Keras François Chollet 《Deep Learning with Python 》, 386页pdf

专知会员服务

163+阅读 · 2019年10月12日

强化学习最新教程，17页pdf

强化学习最新教程，17页pdf

专知会员服务

182+阅读 · 2019年10月11日

[综述]深度学习下的场景文本检测与识别

[综述]深度学习下的场景文本检测与识别

专知会员服务

78+阅读 · 2019年10月10日

【人工智能在2019：一年回顾】反人工智能，AI in 2019: A Year in Review

【人工智能在2019：一年回顾】反人工智能，AI in 2019: A Year in Review

专知会员服务

79+阅读 · 2019年10月10日

【CMU卡内基梅隆大学】深度学习在计算机视觉的应用：方法，解释，因果与公平性

【CMU卡内基梅隆大学】深度学习在计算机视觉的应用：方法，解释，因果与公平性

专知会员服务

83+阅读 · 2019年10月9日

【哈佛大学商学院课程Fall 2019】机器学习可解释性

【哈佛大学商学院课程Fall 2019】机器学习可解释性

专知会员服务

105+阅读 · 2019年10月9日

热门VIP内容

开通专知VIP会员享更多权益服务

隐身自主无人水下航行器技术如何变革水下作战并重塑海军竞争

《俄乌战争中的无人系统：新的战争方式与新兴趋势——来自前线的印象》报告

《海上自主水面船舶远程操作中心：安全可持续运行的多维度分析》

相关资讯

GNN 新基准！Long Range Graph Benchmark

GNN 新基准！Long Range Graph Benchmark

图与推荐

0+阅读 · 2022年10月18日

Hierarchically Structured Meta-learning

Hierarchically Structured Meta-learning

CreateAMind

27+阅读 · 2019年5月22日

Transferring Knowledge across Learning Processes

Transferring Knowledge across Learning Processes

CreateAMind

29+阅读 · 2019年5月18日

强化学习的Unsupervised Meta-Learning

强化学习的Unsupervised Meta-Learning

CreateAMind

18+阅读 · 2019年1月7日

Unsupervised Learning via Meta-Learning

Unsupervised Learning via Meta-Learning

CreateAMind

43+阅读 · 2019年1月3日

A Technical Overview of AI & ML in 2018 & Trends for 2019

A Technical Overview of AI & ML in 2018 & Trends for 2019

待字闺中

18+阅读 · 2018年12月24日

【代码资源】GAN | 七份最热GAN文章及代码分享（Github 1000+Stars）

【代码资源】GAN | 七份最热GAN文章及代码分享（Github 1000+Stars）

专知

13+阅读 · 2018年6月24日

【论文推荐】最新八篇生成对抗网络相关论文—BRE、图像合成、多模态图像生成、非配对多域图、注意力、对抗特征增强、深度对抗性训练

【论文推荐】最新八篇生成对抗网络相关论文—BRE、图像合成、多模态图像生成、非配对多域图、注意力、对抗特征增强、深度对抗性训练

专知

16+阅读 · 2018年5月14日

【论文推荐】最新5篇图像描述生成（Image Caption）相关论文—情感、注意力机制、遥感图像、序列到序列、深度神经结构

【论文推荐】最新5篇图像描述生成（Image Caption）相关论文—情感、注意力机制、遥感图像、序列到序列、深度神经结构

专知

66+阅读 · 2018年1月31日

【论文】变分推断（Variational inference)的总结

【论文】变分推断（Variational inference)的总结

机器学习研究会

39+阅读 · 2017年11月16日

相关论文

Evaluating Open-Domain Question Answering in the Era of Large Language Models

Arxiv

0+阅读 · 2023年7月6日

KoLA: Carefully Benchmarking World Knowledge of Large Language Models

Arxiv

0+阅读 · 2023年7月6日

A Survey on Evaluation of Large Language Models

Arxiv

0+阅读 · 2023年7月6日

Style Over Substance: Evaluation Biases for Large Language Models

Arxiv

0+阅读 · 2023年7月6日

A Systematic Study and Comprehensive Evaluation of ChatGPT on Benchmark Datasets

Arxiv

0+阅读 · 2023年7月5日

Embodied Task Planning with Large Language Models

Arxiv

1+阅读 · 2023年7月4日

CARE-MI: Chinese Benchmark for Misinformation Evaluation in Maternity and Infant Care

Arxiv

0+阅读 · 2023年7月4日

A Critical Re-evaluation of Benchmark Datasets for (Deep) Learning-Based Matching Algorithms

Arxiv

0+阅读 · 2023年7月3日

Large-scale Multi-Modal Pre-trained Models: A Comprehensive Survey

Arxiv

25+阅读 · 2023年2月20日

DOTA: A Large-scale Dataset for Object Detection in Aerial Images

Arxiv

19+阅读 · 2018年1月27日

相关基金

p53基因突变促进Wilms 肿瘤发展转移的小鼠动物模型研究

国家自然科学基金

0+阅读 · 2014年12月31日

Delta-Sarcoglycan基因的两个新突变在东亚人遗传性心肌病中的致病作用及其机理

国家自然科学基金

0+阅读 · 2014年12月31日

常染色体隐性遗传小脑性共济失调新的致病基因CAX的功能研究

国家自然科学基金

0+阅读 · 2014年12月31日

近相变温度下钛表面融盐电解法渗硼的渗层强化生长机制

国家自然科学基金

0+阅读 · 2013年12月31日

p75NTR基因rs2072446多态性所致错义突变在阿尔茨海默病发生中的作用及机制研究

国家自然科学基金

0+阅读 · 2013年12月31日

大行程、高分辨率直线型超声电机及其驱动的精密运动平台的研究

国家自然科学基金

0+阅读 · 2012年12月31日

胆固醇酯转运蛋白在息肉状脉络膜血管病变发病中的作用机制研究

国家自然科学基金

0+阅读 · 2012年12月31日

BRR2蛋白突变导致视网膜色素变性发病机制的研究

国家自然科学基金

0+阅读 · 2011年12月31日

维生素D及其受体基因多态性与高血压发病的研究

国家自然科学基金

0+阅读 · 2009年12月31日

sRAGE对缺血/再灌注的心脏保护作用及其机制

国家自然科学基金

0+阅读 · 2008年12月31日

微信扫码咨询专知VIP会员