LLMMaps——一种用于大型语言模型分层评估的视觉隐喻 (LLMMaps -- A Visual Metaphor for Stratified Evaluation of Large Language Models) - 专知论文

会员服务 ·

0

分层 · 大型语言模型 · 语言模型 · 数据集 · 知识 ·

2023 年 4 月 2 日

LLMMaps -- A Visual Metaphor for Stratified Evaluation of Large Language Models

翻译：LLMMaps——一种用于大型语言模型分层评估的视觉隐喻

Patrik Puchert,Poonam Poonam,Christian van Onzenoodt,Timo Ropinski

Large Language Models (LLMs) have revolutionized natural language processing and demonstrated impressive capabilities in various tasks. Unfortunately, they are prone to hallucinations, where the model exposes incorrect or false information in its responses, which renders diligent evaluation approaches mandatory. While LLM performance in specific knowledge fields is often evaluated based on question and answer (Q&A) datasets, such evaluations usually report only a single accuracy number for the entire field, a procedure which is problematic with respect to transparency and model improvement. A stratified evaluation could instead reveal subfields, where hallucinations are more likely to occur and thus help to better assess LLMs' risks and guide their further development. To support such stratified evaluations, we propose LLMMaps as a novel visualization technique that enables users to evaluate LLMs' performance with respect to Q&A datasets. LLMMaps provide detailed insights into LLMs' knowledge capabilities in different subfields, by transforming Q&A datasets as well as LLM responses into our internal knowledge structure. An extension for comparative visualization furthermore, allows for the detailed comparison of multiple LLMs. To assess LLMMaps we use them to conduct a comparative analysis of several state-of-the-art LLMs, such as BLOOM, GPT-2, GPT-3, ChatGPT and LLaMa-13B, as well as two qualitative user evaluations. All necessary source code and data for generating LLMMaps to be used in scientific publications and elsewhere will be available on GitHub.

翻译：大型语言模型(LLMs)已经在自然语言处理领域带来了革命性的变革，并展示了在各种任务中的引人注目的能力。不幸的是，它们容易产生幻觉，即模型在其回答中暴露不正确或错误的信息，这使得必须采用谨慎的评估方法。虽然LLMs在特定的知识领域中的性能通常基于问题和回答(Q&A)数据集进行评估，但这种评估通常仅报告整个领域的单个准确性数字，这个过程在透明度和模型改进方面是有问题的。分层评估可以揭示易发生幻觉的子领域，从而更好地评估LLMs的风险并指导其进一步发展。为了支持这种分层评估，我们提出了LLMMaps作为一种新颖的可视化技术，使用户能够使用Q&A数据集评估LLMs在不同子领域的性能。LLMMaps通过将Q&A数据集以及LLM响应转化为我们的内部知识结构，提供了对LLMs在不同子领域知识能力的详细洞察。进一步的比较可视化扩展还允许详细比较多个LLMs。为了评估LLMMaps，我们使用它们来对多种最先进的LLMs进行比较分析，如BLOOM、GPT-2、GPT-3、ChatGPT和LLaMa-13B，以及两项定性用户评估。在GitHub上提供了用于生成科学出版物和其他地方使用LLMMaps所需的所有源代码和数据。

0

相关内容

百篇论文纵览大型语言模型最新研究进展

百篇论文纵览大型语言模型最新研究进展

专知会员服务

70+阅读 · 2023年3月31日

【Hugging Face】指导文本生成与约束波束搜索🤗Transformers，Guiding Text Generation with Constrained Beam Search in 🤗 Transformers

【Hugging Face】指导文本生成与约束波束搜索🤗Transformers，Guiding Text Generation with Constrained Beam Search in 🤗 Transformers

专知会员服务

22+阅读 · 2022年3月18日

【Hugging Face】使用自定义数据集微调语义分割模型，Fine-Tune a Semantic Segmentation Model with a Custom Dataset

【Hugging Face】使用自定义数据集微调语义分割模型，Fine-Tune a Semantic Segmentation Model with a Custom Dataset

专知会员服务

21+阅读 · 2022年3月18日

【UNC-Peter Hase】自然语言处理中的可解释机器学习:方法与评估，34页ppt

【UNC-Peter Hase】自然语言处理中的可解释机器学习:方法与评估，34页ppt

专知会员服务

36+阅读 · 2022年3月10日

【伯克利Roshan Rao博士论文】训练，评估和理解蛋白质序列的进化模型，Training, Evaluating, and Understanding Evolutionary Models for Protein Sequences

【伯克利Roshan Rao博士论文】训练，评估和理解蛋白质序列的进化模型，Training, Evaluating, and Understanding Evolutionary Models for Protein Sequences

专知会员服务

17+阅读 · 2022年3月6日

【Google ICLR2020论文】嵌入式大规模检索的预训练任务，Pre-training Tasks for Embedding-based Large-scale Retrieval

【Google ICLR2020论文】嵌入式大规模检索的预训练任务，Pre-training Tasks for Embedding-based Large-scale Retrieval

专知会员服务

28+阅读 · 2020年2月12日

【论文】评估可扩展贝叶斯深度学习强大的计算机视觉的方法（Evaluating Scalable Bayesian Deep LearningMethods for Robust Computer Vision）

【论文】评估可扩展贝叶斯深度学习强大的计算机视觉的方法（Evaluating Scalable Bayesian Deep LearningMethods for Robust Computer Vision）

专知会员服务

12+阅读 · 2020年1月13日

2019年自然语言处理NLP亮点总结，29页pdf，NLP Year in Review — 2019 NLP highlights for the year 2019.

2019年自然语言处理NLP亮点总结，29页pdf，NLP Year in Review — 2019 NLP highlights for the year 2019.

专知会员服务

69+阅读 · 2020年1月2日

【NLP| 推荐文章】基于文本和知识库的语义搜索（Semantic search on text and knowledge bases）

专知会员服务

46+阅读 · 2019年11月24日

Deep Learning Based Detection and Correction of Cardiac MR Motion Artefacts During Reconstruction for High-Quality Segmentation

Deep Learning Based Detection and Correction of Cardiac MR Motion Artefacts During Reconstruction for High-Quality Segmentation

专知会员服务

59+阅读 · 2019年10月17日

使用BERT做文本摘要

使用BERT做文本摘要

专知

23+阅读 · 2019年12月7日

RoBERTa中文预训练模型：RoBERTa for Chinese

RoBERTa中文预训练模型：RoBERTa for Chinese

PaperWeekly

57+阅读 · 2019年9月16日

Hierarchically Structured Meta-learning

Hierarchically Structured Meta-learning

CreateAMind

27+阅读 · 2019年5月22日

【泡泡一分钟】用于评估视觉惯性里程计的TUM VI数据集

【泡泡一分钟】用于评估视觉惯性里程计的TUM VI数据集

泡泡机器人SLAM

11+阅读 · 2019年1月4日

Unsupervised Learning via Meta-Learning

Unsupervised Learning via Meta-Learning

CreateAMind

43+阅读 · 2019年1月3日

A Technical Overview of AI & ML in 2018 & Trends for 2019

A Technical Overview of AI & ML in 2018 & Trends for 2019

待字闺中

18+阅读 · 2018年12月24日

BERT 现已开源：最先进的 NLP 预训练技术，支持中文和更多语言

BERT 现已开源：最先进的 NLP 预训练技术，支持中文和更多语言

谷歌开发者

16+阅读 · 2018年11月6日

vae 相关论文表示学习 1

vae 相关论文表示学习 1

CreateAMind

12+阅读 · 2018年9月6日

【论文推荐】最新六篇推荐系统相关论文—注意力机制、多任务、协同跨网络、非结构化文本、TransRev、章节推荐

【论文推荐】最新六篇推荐系统相关论文—注意力机制、多任务、协同跨网络、非结构化文本、TransRev、章节推荐

专知

12+阅读 · 2018年4月26日

【论文】变分推断（Variational inference)的总结

【论文】变分推断（Variational inference)的总结

机器学习研究会

39+阅读 · 2017年11月16日

复多项式的核拓扑熵

国家自然科学基金

0+阅读 · 2015年12月31日

c-Myc-GPC5通路调控前列腺癌进展的分子机理

国家自然科学基金

0+阅读 · 2013年12月31日

面向网络安全态势感知的多层次可视分析方法研究

国家自然科学基金

9+阅读 · 2013年12月31日

lnc-Oct4结合miR-145上调Oct4促进膀胱癌演进的机制研究

国家自然科学基金

0+阅读 · 2012年12月31日

Reality-based Interaction用户界面模型和评估方法研究

国家自然科学基金

0+阅读 · 2011年12月31日

我国医院绩效评价方法与实证研究

国家自然科学基金

0+阅读 · 2011年12月31日

基于WRF模式系统的InSAR大气校正方法研究

国家自然科学基金

0+阅读 · 2011年12月31日

Rayleigh信道统计分析和建模

国家自然科学基金

0+阅读 · 2009年12月31日

基于训练效果的部队作战效能评估及作战计划制订方法研究

国家自然科学基金

96+阅读 · 2009年12月31日

支持多信息融合的装备体系作战效能评估方法研究

国家自然科学基金

36+阅读 · 2008年12月31日

How Old is GPT?: The HumBEL Framework for Evaluating Language Models using Human Demographic Dat

Arxiv

0+阅读 · 2023年5月23日

VideoLLM: Modeling Video Sequence with Large Language Models

Arxiv

0+阅读 · 2023年5月23日

Discrete Prompt Optimization via Constrained Generation for Zero-shot Re-ranker

Arxiv

0+阅读 · 2023年5月23日

Can ChatGPT Detect Intent? Evaluating Large Language Models for Spoken Language Understanding

Arxiv

0+阅读 · 2023年5月22日

Rethinking the Evaluation for Conversational Recommendation in the Era of Large Language Models

Arxiv

1+阅读 · 2023年5月22日

Evaluating and Enhancing Structural Understanding Capabilities of Large Language Models on Tables via Input Designs

Arxiv

0+阅读 · 2023年5月22日

Knowledge Refinement via Interaction Between Search Engines and Large Language Models

Arxiv

0+阅读 · 2023年5月21日

HELMA: A Large-Scale Hallucination Evaluation Benchmark for Large Language Models

Arxiv

0+阅读 · 2023年5月19日

CRITIC: Large Language Models Can Self-Correct with Tool-Interactive Critiquing

Arxiv

0+阅读 · 2023年5月19日

Self-Agreement: A Framework for Fine-tuning Language Models to Find Agreement among Diverse Opinions

Arxiv

0+阅读 · 2023年5月19日

VIP会员

文章信息

相关主题

大型语言模型

相关VIP内容

百篇论文纵览大型语言模型最新研究进展

百篇论文纵览大型语言模型最新研究进展

专知会员服务

70+阅读 · 2023年3月31日

【Hugging Face】指导文本生成与约束波束搜索🤗Transformers，Guiding Text Generation with Constrained Beam Search in 🤗 Transformers

【Hugging Face】指导文本生成与约束波束搜索🤗Transformers，Guiding Text Generation with Constrained Beam Search in 🤗 Transformers

专知会员服务

22+阅读 · 2022年3月18日

【Hugging Face】使用自定义数据集微调语义分割模型，Fine-Tune a Semantic Segmentation Model with a Custom Dataset

【Hugging Face】使用自定义数据集微调语义分割模型，Fine-Tune a Semantic Segmentation Model with a Custom Dataset

专知会员服务

21+阅读 · 2022年3月18日

【UNC-Peter Hase】自然语言处理中的可解释机器学习:方法与评估，34页ppt

【UNC-Peter Hase】自然语言处理中的可解释机器学习:方法与评估，34页ppt

专知会员服务

36+阅读 · 2022年3月10日

【伯克利Roshan Rao博士论文】训练，评估和理解蛋白质序列的进化模型，Training, Evaluating, and Understanding Evolutionary Models for Protein Sequences

【伯克利Roshan Rao博士论文】训练，评估和理解蛋白质序列的进化模型，Training, Evaluating, and Understanding Evolutionary Models for Protein Sequences

专知会员服务

17+阅读 · 2022年3月6日

【Google ICLR2020论文】嵌入式大规模检索的预训练任务，Pre-training Tasks for Embedding-based Large-scale Retrieval

【Google ICLR2020论文】嵌入式大规模检索的预训练任务，Pre-training Tasks for Embedding-based Large-scale Retrieval

专知会员服务

28+阅读 · 2020年2月12日

【论文】评估可扩展贝叶斯深度学习强大的计算机视觉的方法（Evaluating Scalable Bayesian Deep LearningMethods for Robust Computer Vision）

【论文】评估可扩展贝叶斯深度学习强大的计算机视觉的方法（Evaluating Scalable Bayesian Deep LearningMethods for Robust Computer Vision）

专知会员服务

12+阅读 · 2020年1月13日

2019年自然语言处理NLP亮点总结，29页pdf，NLP Year in Review — 2019 NLP highlights for the year 2019.

2019年自然语言处理NLP亮点总结，29页pdf，NLP Year in Review — 2019 NLP highlights for the year 2019.

专知会员服务

69+阅读 · 2020年1月2日

【NLP| 推荐文章】基于文本和知识库的语义搜索（Semantic search on text and knowledge bases）

专知会员服务

46+阅读 · 2019年11月24日

Deep Learning Based Detection and Correction of Cardiac MR Motion Artefacts During Reconstruction for High-Quality Segmentation

Deep Learning Based Detection and Correction of Cardiac MR Motion Artefacts During Reconstruction for High-Quality Segmentation

专知会员服务

59+阅读 · 2019年10月17日

热门VIP内容

开通专知VIP会员享更多权益服务

《俄乌战争中的无人系统：新的战争方式与新兴趋势——来自前线的印象》报告

《海上自主水面船舶远程操作中心：安全可持续运行的多维度分析》

多模态大语言模型下游调优中“保持自我”的重要性

隐身自主无人水下航行器技术如何变革水下作战并重塑海军竞争

相关资讯

使用BERT做文本摘要

使用BERT做文本摘要

专知

23+阅读 · 2019年12月7日

RoBERTa中文预训练模型：RoBERTa for Chinese

RoBERTa中文预训练模型：RoBERTa for Chinese

PaperWeekly

57+阅读 · 2019年9月16日

Hierarchically Structured Meta-learning

Hierarchically Structured Meta-learning

CreateAMind

27+阅读 · 2019年5月22日

【泡泡一分钟】用于评估视觉惯性里程计的TUM VI数据集

【泡泡一分钟】用于评估视觉惯性里程计的TUM VI数据集

泡泡机器人SLAM

11+阅读 · 2019年1月4日

Unsupervised Learning via Meta-Learning

Unsupervised Learning via Meta-Learning

CreateAMind

43+阅读 · 2019年1月3日

A Technical Overview of AI & ML in 2018 & Trends for 2019

A Technical Overview of AI & ML in 2018 & Trends for 2019

待字闺中

18+阅读 · 2018年12月24日

BERT 现已开源：最先进的 NLP 预训练技术，支持中文和更多语言

BERT 现已开源：最先进的 NLP 预训练技术，支持中文和更多语言

谷歌开发者

16+阅读 · 2018年11月6日

vae 相关论文表示学习 1

vae 相关论文表示学习 1

CreateAMind

12+阅读 · 2018年9月6日

【论文推荐】最新六篇推荐系统相关论文—注意力机制、多任务、协同跨网络、非结构化文本、TransRev、章节推荐

【论文推荐】最新六篇推荐系统相关论文—注意力机制、多任务、协同跨网络、非结构化文本、TransRev、章节推荐

专知

12+阅读 · 2018年4月26日

【论文】变分推断（Variational inference)的总结

【论文】变分推断（Variational inference)的总结

机器学习研究会

39+阅读 · 2017年11月16日

相关论文

How Old is GPT?: The HumBEL Framework for Evaluating Language Models using Human Demographic Dat

Arxiv

0+阅读 · 2023年5月23日

VideoLLM: Modeling Video Sequence with Large Language Models

Arxiv

0+阅读 · 2023年5月23日

Discrete Prompt Optimization via Constrained Generation for Zero-shot Re-ranker

Arxiv

0+阅读 · 2023年5月23日

Can ChatGPT Detect Intent? Evaluating Large Language Models for Spoken Language Understanding

Arxiv

0+阅读 · 2023年5月22日

Rethinking the Evaluation for Conversational Recommendation in the Era of Large Language Models

Arxiv

1+阅读 · 2023年5月22日

Evaluating and Enhancing Structural Understanding Capabilities of Large Language Models on Tables via Input Designs

Arxiv

0+阅读 · 2023年5月22日

Knowledge Refinement via Interaction Between Search Engines and Large Language Models

Arxiv

0+阅读 · 2023年5月21日

HELMA: A Large-Scale Hallucination Evaluation Benchmark for Large Language Models

Arxiv

0+阅读 · 2023年5月19日

CRITIC: Large Language Models Can Self-Correct with Tool-Interactive Critiquing

Arxiv

0+阅读 · 2023年5月19日

Self-Agreement: A Framework for Fine-tuning Language Models to Find Agreement among Diverse Opinions

Arxiv

0+阅读 · 2023年5月19日

相关基金

复多项式的核拓扑熵

国家自然科学基金

0+阅读 · 2015年12月31日

c-Myc-GPC5通路调控前列腺癌进展的分子机理

国家自然科学基金

0+阅读 · 2013年12月31日

面向网络安全态势感知的多层次可视分析方法研究

国家自然科学基金

9+阅读 · 2013年12月31日

lnc-Oct4结合miR-145上调Oct4促进膀胱癌演进的机制研究

国家自然科学基金

0+阅读 · 2012年12月31日

Reality-based Interaction用户界面模型和评估方法研究

国家自然科学基金

0+阅读 · 2011年12月31日

我国医院绩效评价方法与实证研究

国家自然科学基金

0+阅读 · 2011年12月31日

基于WRF模式系统的InSAR大气校正方法研究

国家自然科学基金

0+阅读 · 2011年12月31日

Rayleigh信道统计分析和建模

国家自然科学基金

0+阅读 · 2009年12月31日

基于训练效果的部队作战效能评估及作战计划制订方法研究

国家自然科学基金

96+阅读 · 2009年12月31日

支持多信息融合的装备体系作战效能评估方法研究

国家自然科学基金

36+阅读 · 2008年12月31日

微信扫码咨询专知VIP会员