如何评价总结:研究设计和统计分析,用于语言质量评价手册 (How to Evaluate a Summarizer: Study Design and Statistical Analysis for Manual Linguistic Quality Evaluation) - 专知论文

会员服务 ·

0

统计量 · Performer · Less · 秩 · 分解的 ·

2021 年 1 月 27 日

How to Evaluate a Summarizer: Study Design and Statistical Analysis for Manual Linguistic Quality Evaluation

翻译：如何评价总结:研究设计和统计分析,用于语言质量评价手册

Julius Steen,Katja Markert

from arxiv, Accepted at EACL 2021

Manual evaluation is essential to judge progress on automatic text summarization. However, we conduct a survey on recent summarization system papers that reveals little agreement on how to perform such evaluation studies. We conduct two evaluation experiments on two aspects of summaries' linguistic quality (coherence and repetitiveness) to compare Likert-type and ranking annotations and show that best choice of evaluation method can vary from one aspect to another. In our survey, we also find that study parameters such as the overall number of annotators and distribution of annotators to annotation items are often not fully reported and that subsequent statistical analysis ignores grouping factors arising from one annotator judging multiple summaries. Using our evaluation experiments, we show that the total number of annotators can have a strong impact on study power and that current statistical analysis methods can inflate type I error rates up to eight-fold. In addition, we highlight that for the purpose of system comparison the current practice of eliciting multiple judgements per summary leads to less powerful and reliable annotations given a fixed study budget.

翻译：手册评价对于判断自动文本摘要的进展至关重要。然而,我们对最近的汇总系统文件进行调查,显示对如何进行这种评价研究没有取得多少一致。我们对摘要语言质量的两个方面(一致性和重复性)进行了两次评价试验,比较类似类型和分级说明,显示评价方法的最佳选择可能因一个方面而异。在我们的调查中,我们还发现研究参数,例如通知员总数和通知员对说明项目的分配,往往没有全面报告,随后的统计分析忽略了从一个评分员判断多个摘要所产生的分类因素。我们通过评价试验,我们表明,批注者总数可对研究能力产生很大影响,目前的统计分析方法可将I型误差率提高到8倍。此外,我们强调,为了系统的目的,为了比较目前对每个摘要作出多重判断的做法,从固定的研究预算来看,从获得较弱和可靠的说明。

0

相关内容

统计量

【新书】R语言统计学习，R for Statistical Learning，301页pdf

专知会员服务

30+阅读 · 2020年11月4日

Linux导论，Introduction to Linux，96页ppt

Linux导论，Introduction to Linux，96页ppt

专知会员服务

81+阅读 · 2020年7月26日

【SIGIR2020】多检索系统的贝叶斯推理风险评估，Bayesian Inferential Risk Evaluation On Multiple IR Systems

【SIGIR2020】多检索系统的贝叶斯推理风险评估，Bayesian Inferential Risk Evaluation On Multiple IR Systems

专知会员服务

9+阅读 · 2020年6月10日

【剑桥大学】统计因果关系的决策理论基础，Decision-theoretic foundations for statistical causality

【剑桥大学】统计因果关系的决策理论基础，Decision-theoretic foundations for statistical causality

专知会员服务

48+阅读 · 2020年5月5日

100+篇《自监督学习(Self-Supervised Learning)》论文最新合集

100+篇《自监督学习(Self-Supervised Learning)》论文最新合集

专知会员服务

166+阅读 · 2020年3月18日

【综述】文献级机器翻译研究:方法与评价（A Survey on Document-level Machine Translation: Methods and Evaluation）

【综述】文献级机器翻译研究:方法与评价（A Survey on Document-level Machine Translation: Methods and Evaluation）

专知会员服务

7+阅读 · 2019年12月19日

Classic Clustering Algorithms to Live By [ 熊辉，罗格斯－新泽西州立大学教授] 2019年中国计算机大会计算机经典算法回顾与展望——机器学习与数据挖掘论坛

Classic Clustering Algorithms to Live By [ 熊辉，罗格斯－新泽西州立大学教授] 2019年中国计算机大会计算机经典算法回顾与展望——机器学习与数据挖掘论坛

专知会员服务

10+阅读 · 2019年10月26日

Deep Learning Based Detection and Correction of Cardiac MR Motion Artefacts During Reconstruction for High-Quality Segmentation

Deep Learning Based Detection and Correction of Cardiac MR Motion Artefacts During Reconstruction for High-Quality Segmentation

专知会员服务

59+阅读 · 2019年10月17日

【SIGGRAPH2019】TensorFlow 2.0深度学习计算机图形学应用

【SIGGRAPH2019】TensorFlow 2.0深度学习计算机图形学应用

专知会员服务

41+阅读 · 2019年10月9日

MIT新书《强化学习与最优控制》

MIT新书《强化学习与最优控制》

专知会员服务

281+阅读 · 2019年10月9日

计算机 | 国际会议信息5条

计算机 | 国际会议信息5条

Call4Papers

3+阅读 · 2019年7月3日

Transferring Knowledge across Learning Processes

Transferring Knowledge across Learning Processes

CreateAMind

29+阅读 · 2019年5月18日

计算机 | 中低难度国际会议信息6条

计算机 | 中低难度国际会议信息6条

Call4Papers

7+阅读 · 2019年5月16日

计算机 | EMNLP 2019等国际会议信息6条

计算机 | EMNLP 2019等国际会议信息6条

Call4Papers

18+阅读 · 2019年4月26日

CCF A类 | 顶级会议RTSS 2019诚邀稿件

CCF A类 | 顶级会议RTSS 2019诚邀稿件

Call4Papers

10+阅读 · 2019年4月17日

Unsupervised Learning via Meta-Learning

Unsupervised Learning via Meta-Learning

CreateAMind

43+阅读 · 2019年1月3日

计算机类 | ISCC 2019等国际会议信息9条

计算机类 | ISCC 2019等国际会议信息9条

Call4Papers

5+阅读 · 2018年12月25日

计算机视觉的不同任务

计算机视觉的不同任务

专知

5+阅读 · 2018年8月27日

Hierarchical Disentangled Representations

Hierarchical Disentangled Representations

CreateAMind

4+阅读 · 2018年4月15日

【学习】Hierarchical Softmax

【学习】Hierarchical Softmax

机器学习研究会

4+阅读 · 2017年8月6日

How do we Evaluate Self-adaptive Software Systems?

Arxiv

0+阅读 · 2021年3月21日

Play the Shannon Game With Language Models: A Human-Free Approach to Summary Evaluation

Play the Shannon Game With Language Models: A Human-Free Approach to Summary Evaluation

Arxiv

0+阅读 · 2021年3月19日

Theory and Evaluation Metrics for Learning Disentangled Representations

Arxiv

0+阅读 · 2021年3月18日

Learning How to Optimize Black-Box Functions With Extreme Limits on the Number of Function Evaluations

Learning How to Optimize Black-Box Functions With Extreme Limits on the Number of Function Evaluations

Arxiv

0+阅读 · 2021年3月18日

Analysis and Evaluation of Language Models for Word Sense Disambiguation

Arxiv

0+阅读 · 2021年3月17日

Deep learning evaluation using deep linguistic processing

Arxiv

3+阅读 · 2018年5月12日

Dynamic and Static Topic Model for Analyzing Time-Series Document Collections

Arxiv

8+阅读 · 2018年5月6日

Stylistic Variation in Social Media Part-of-Speech Tagging

Arxiv

4+阅读 · 2018年4月19日

Group Normalization

Arxiv

7+阅读 · 2018年3月22日

SentiPers: A Sentiment Analysis Corpus for Persian

Arxiv

5+阅读 · 2018年1月23日

VIP会员

文章信息

相关主题

相关VIP内容

【新书】R语言统计学习，R for Statistical Learning，301页pdf

专知会员服务

30+阅读 · 2020年11月4日

Linux导论，Introduction to Linux，96页ppt

Linux导论，Introduction to Linux，96页ppt

专知会员服务

81+阅读 · 2020年7月26日

【SIGIR2020】多检索系统的贝叶斯推理风险评估，Bayesian Inferential Risk Evaluation On Multiple IR Systems

【SIGIR2020】多检索系统的贝叶斯推理风险评估，Bayesian Inferential Risk Evaluation On Multiple IR Systems

专知会员服务

9+阅读 · 2020年6月10日

【剑桥大学】统计因果关系的决策理论基础，Decision-theoretic foundations for statistical causality

【剑桥大学】统计因果关系的决策理论基础，Decision-theoretic foundations for statistical causality

专知会员服务

48+阅读 · 2020年5月5日

100+篇《自监督学习(Self-Supervised Learning)》论文最新合集

100+篇《自监督学习(Self-Supervised Learning)》论文最新合集

专知会员服务

166+阅读 · 2020年3月18日

【综述】文献级机器翻译研究:方法与评价（A Survey on Document-level Machine Translation: Methods and Evaluation）

【综述】文献级机器翻译研究:方法与评价（A Survey on Document-level Machine Translation: Methods and Evaluation）

专知会员服务

7+阅读 · 2019年12月19日

Classic Clustering Algorithms to Live By [ 熊辉，罗格斯－新泽西州立大学教授] 2019年中国计算机大会计算机经典算法回顾与展望——机器学习与数据挖掘论坛

Classic Clustering Algorithms to Live By [ 熊辉，罗格斯－新泽西州立大学教授] 2019年中国计算机大会计算机经典算法回顾与展望——机器学习与数据挖掘论坛

专知会员服务

10+阅读 · 2019年10月26日

Deep Learning Based Detection and Correction of Cardiac MR Motion Artefacts During Reconstruction for High-Quality Segmentation

Deep Learning Based Detection and Correction of Cardiac MR Motion Artefacts During Reconstruction for High-Quality Segmentation

专知会员服务

59+阅读 · 2019年10月17日

【SIGGRAPH2019】TensorFlow 2.0深度学习计算机图形学应用

【SIGGRAPH2019】TensorFlow 2.0深度学习计算机图形学应用

专知会员服务

41+阅读 · 2019年10月9日

MIT新书《强化学习与最优控制》

MIT新书《强化学习与最优控制》

专知会员服务

281+阅读 · 2019年10月9日

热门VIP内容

开通专知VIP会员享更多权益服务

最新《扩散模型原理》新书，470页pdf

无人机作战：演进、创新与未来战场

AI 智能体简史

多模态空间推理在大模型时代：综述与基准测试

相关资讯

计算机 | 国际会议信息5条

计算机 | 国际会议信息5条

Call4Papers

3+阅读 · 2019年7月3日

Transferring Knowledge across Learning Processes

Transferring Knowledge across Learning Processes

CreateAMind

29+阅读 · 2019年5月18日

计算机 | 中低难度国际会议信息6条

计算机 | 中低难度国际会议信息6条

Call4Papers

7+阅读 · 2019年5月16日

计算机 | EMNLP 2019等国际会议信息6条

计算机 | EMNLP 2019等国际会议信息6条

Call4Papers

18+阅读 · 2019年4月26日

CCF A类 | 顶级会议RTSS 2019诚邀稿件

CCF A类 | 顶级会议RTSS 2019诚邀稿件

Call4Papers

10+阅读 · 2019年4月17日

Unsupervised Learning via Meta-Learning

Unsupervised Learning via Meta-Learning

CreateAMind

43+阅读 · 2019年1月3日

计算机类 | ISCC 2019等国际会议信息9条

计算机类 | ISCC 2019等国际会议信息9条

Call4Papers

5+阅读 · 2018年12月25日

计算机视觉的不同任务

计算机视觉的不同任务

专知

5+阅读 · 2018年8月27日

Hierarchical Disentangled Representations

Hierarchical Disentangled Representations

CreateAMind

4+阅读 · 2018年4月15日

【学习】Hierarchical Softmax

【学习】Hierarchical Softmax

机器学习研究会

4+阅读 · 2017年8月6日

相关论文

How do we Evaluate Self-adaptive Software Systems?

Arxiv

0+阅读 · 2021年3月21日

Play the Shannon Game With Language Models: A Human-Free Approach to Summary Evaluation

Play the Shannon Game With Language Models: A Human-Free Approach to Summary Evaluation

Arxiv

0+阅读 · 2021年3月19日

Theory and Evaluation Metrics for Learning Disentangled Representations

Arxiv

0+阅读 · 2021年3月18日

Learning How to Optimize Black-Box Functions With Extreme Limits on the Number of Function Evaluations

Learning How to Optimize Black-Box Functions With Extreme Limits on the Number of Function Evaluations

Arxiv

0+阅读 · 2021年3月18日

Analysis and Evaluation of Language Models for Word Sense Disambiguation

Arxiv

0+阅读 · 2021年3月17日

Deep learning evaluation using deep linguistic processing

Arxiv

3+阅读 · 2018年5月12日

Dynamic and Static Topic Model for Analyzing Time-Series Document Collections

Arxiv

8+阅读 · 2018年5月6日

Stylistic Variation in Social Media Part-of-Speech Tagging

Arxiv

4+阅读 · 2018年4月19日

Group Normalization

Arxiv

7+阅读 · 2018年3月22日

SentiPers: A Sentiment Analysis Corpus for Persian

Arxiv

5+阅读 · 2018年1月23日

微信扫码咨询专知VIP会员