采用抽取方法的计量方法统计分析 (A Statistical Analysis of Summarization Evaluation Metrics using Resampling Methods) - 专知论文

会员服务 ·

0

相关系数 · 统计量 · 置信度 · ROUGE · 估计/估计量 ·

2021 年 7 月 26 日

A Statistical Analysis of Summarization Evaluation Metrics using Resampling Methods

翻译：采用抽取方法的计量方法统计分析

Daniel Deutsch,Rotem Dror,Dan Roth

from arxiv, This is a pre-MIT Press publication version of the paper

The quality of a summarization evaluation metric is quantified by calculating the correlation between its scores and human annotations across a large number of summaries. Currently, it is unclear how precise these correlation estimates are, nor whether differences between two metrics' correlations reflect a true difference or if it is due to mere chance. In this work, we address these two problems by proposing methods for calculating confidence intervals and running hypothesis tests for correlations using two resampling methods, bootstrapping and permutation. After evaluating which of the proposed methods is most appropriate for summarization through two simulation experiments, we analyze the results of applying these methods to several different automatic evaluation metrics across three sets of human annotations. We find that the confidence intervals are rather wide, demonstrating high uncertainty in the reliability of automatic metrics. Further, although many metrics fail to show statistical improvements over ROUGE, two recent works, QAEval and BERTScore, do in some evaluation settings.

翻译：总结性评价指标的质量是通过在大量摘要中计算其分数和人文说明的相互关系来量化的。目前,尚不清楚这些相关估计的准确性如何,或两个计量的相互关系之间的差异是否反映了真正的差异,或是否只是偶然的。在这项工作中,我们通过提出计算信任期的方法和采用两种重新采样方法对相互关系进行假设测试来解决这两个问题。在通过两个模拟实验对哪些拟议方法最适合进行总结之后,我们分析了将这些方法应用于三套人类说明的若干不同自动评价指标的结果。我们发现信任期相当宽,表明自动指标的可靠性有很大的不确定性。此外,虽然许多指标未能显示在ROUGE(两个近期的工程,QAEval和BERTScore)的统计方面有所改进,但在一些评价环境中,许多指标未能显示在两个最近的工程,即QAEval和BERTScore(QAval和BERTScore)方面的统计改进。

0

相关内容

相关系数

深度概率图模型，Deep Probabilistic Models

专知会员服务

29+阅读 · 2021年8月2日

INRIA 最新《机器学习理论》课程笔记，176页pdf

专知会员服务

51+阅读 · 2020年12月14日

【视频描述综述论文】Video Description: A Survey of Methods, Datasets, and Evaluation Metrics

【视频描述综述论文】Video Description: A Survey of Methods, Datasets, and Evaluation Metrics

专知会员服务

65+阅读 · 2020年5月12日

【ACL2020-Google】学习鲁棒度量的文本生成，BLEURT: Learning Robust Metrics for Text Generation

【ACL2020-Google】学习鲁棒度量的文本生成，BLEURT: Learning Robust Metrics for Text Generation

专知会员服务

17+阅读 · 2020年4月10日

【综述】文献级机器翻译研究:方法与评价（A Survey on Document-level Machine Translation: Methods and Evaluation）

【综述】文献级机器翻译研究:方法与评价（A Survey on Document-level Machine Translation: Methods and Evaluation）

专知会员服务

7+阅读 · 2019年12月19日

【变分推断课件】Lectures on Variational Inference：Statistical Analysis of Variational Approximations（附带pdf）

【变分推断课件】Lectures on Variational Inference：Statistical Analysis of Variational Approximations（附带pdf）

专知会员服务

16+阅读 · 2019年11月30日

【报告推荐】三维及超几何处理中的几何与数据学习（Geometry and Learning from Data in 3D and Beyond - Geometric Processing ）

【报告推荐】三维及超几何处理中的几何与数据学习（Geometry and Learning from Data in 3D and Beyond - Geometric Processing ）

专知会员服务

12+阅读 · 2019年11月10日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日

机器学习入门的经验与建议

机器学习入门的经验与建议

专知会员服务

94+阅读 · 2019年10月10日

【哈佛大学商学院课程Fall 2019】机器学习可解释性

【哈佛大学商学院课程Fall 2019】机器学习可解释性

专知会员服务

105+阅读 · 2019年10月9日

Hierarchically Structured Meta-learning

Hierarchically Structured Meta-learning

CreateAMind

27+阅读 · 2019年5月22日

【泡泡一分钟】用于评估视觉惯性里程计的TUM VI数据集

【泡泡一分钟】用于评估视觉惯性里程计的TUM VI数据集

泡泡机器人SLAM

11+阅读 · 2019年1月4日

Unsupervised Learning via Meta-Learning

Unsupervised Learning via Meta-Learning

CreateAMind

43+阅读 · 2019年1月3日

A Technical Overview of AI & ML in 2018 & Trends for 2019

A Technical Overview of AI & ML in 2018 & Trends for 2019

待字闺中

18+阅读 · 2018年12月24日

已删除

将门创投

4+阅读 · 2018年11月15日

disentangled-representation-papers

disentangled-representation-papers

CreateAMind

26+阅读 · 2018年9月12日

Hierarchical Disentangled Representations

Hierarchical Disentangled Representations

CreateAMind

4+阅读 · 2018年4月15日

【推荐】自然语言处理（NLP）指南

【推荐】自然语言处理（NLP）指南

机器学习研究会

35+阅读 · 2017年11月17日

【推荐】GAN架构入门综述(资源汇总)

【推荐】GAN架构入门综述(资源汇总)

机器学习研究会

10+阅读 · 2017年9月3日

【学习】Hierarchical Softmax

【学习】Hierarchical Softmax

机器学习研究会

4+阅读 · 2017年8月6日

Demystifying statistical learning based on efficient influence functions

Arxiv

0+阅读 · 2021年9月27日

An Analysis into the Performance and Memory Usage of MATLAB Strings

Arxiv

0+阅读 · 2021年9月26日

Statistical Inference for Data Integration

Arxiv

0+阅读 · 2021年9月25日

Sample Efficient Model Evaluation

Sample Efficient Model Evaluation

Arxiv

0+阅读 · 2021年9月24日

Optimal policy evaluation using kernel-based temporal difference methods

Arxiv

0+阅读 · 2021年9月24日

Statistical Learning using Sparse Deep Neural Networks in Empirical Risk Minimization

Arxiv

0+阅读 · 2021年9月24日

GeomCA: Geometric Evaluation of Data Representations

GeomCA: Geometric Evaluation of Data Representations

Arxiv

11+阅读 · 2021年5月26日

Learning to Weight for Text Classification

Learning to Weight for Text Classification

Arxiv

8+阅读 · 2019年3月28日

Lessons from the Bible on Modern Topics: Low-Resource Multilingual Topic Model Evaluation

Arxiv

4+阅读 · 2018年4月26日

A Systematic Evaluation and Benchmark for Person Re-Identification: Features, Metrics, and Datasets

Arxiv

5+阅读 · 2018年2月14日

VIP会员

文章信息

相关主题

估计/估计量

相关VIP内容

深度概率图模型，Deep Probabilistic Models

专知会员服务

29+阅读 · 2021年8月2日

INRIA 最新《机器学习理论》课程笔记，176页pdf

专知会员服务

51+阅读 · 2020年12月14日

【视频描述综述论文】Video Description: A Survey of Methods, Datasets, and Evaluation Metrics

【视频描述综述论文】Video Description: A Survey of Methods, Datasets, and Evaluation Metrics

专知会员服务

65+阅读 · 2020年5月12日

【ACL2020-Google】学习鲁棒度量的文本生成，BLEURT: Learning Robust Metrics for Text Generation

【ACL2020-Google】学习鲁棒度量的文本生成，BLEURT: Learning Robust Metrics for Text Generation

专知会员服务

17+阅读 · 2020年4月10日

【综述】文献级机器翻译研究:方法与评价（A Survey on Document-level Machine Translation: Methods and Evaluation）

【综述】文献级机器翻译研究:方法与评价（A Survey on Document-level Machine Translation: Methods and Evaluation）

专知会员服务

7+阅读 · 2019年12月19日

【变分推断课件】Lectures on Variational Inference：Statistical Analysis of Variational Approximations（附带pdf）

【变分推断课件】Lectures on Variational Inference：Statistical Analysis of Variational Approximations（附带pdf）

专知会员服务

16+阅读 · 2019年11月30日

【报告推荐】三维及超几何处理中的几何与数据学习（Geometry and Learning from Data in 3D and Beyond - Geometric Processing ）

【报告推荐】三维及超几何处理中的几何与数据学习（Geometry and Learning from Data in 3D and Beyond - Geometric Processing ）

专知会员服务

12+阅读 · 2019年11月10日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日

机器学习入门的经验与建议

机器学习入门的经验与建议

专知会员服务

94+阅读 · 2019年10月10日

【哈佛大学商学院课程Fall 2019】机器学习可解释性

【哈佛大学商学院课程Fall 2019】机器学习可解释性

专知会员服务

105+阅读 · 2019年10月9日

热门VIP内容

开通专知VIP会员享更多权益服务

大语言模型智能体强化学习：全景综述

《城市滨海地区：理解复杂多变环境下的指挥控制框架》50页报告

【伯克利博士论文】从推理服务到训练：面向大规模 LLM 智能体的高效系统

美空军“顶点2025”实验：推进AI在C2、动态目标锁定与联盟集成中的应用

相关资讯

Hierarchically Structured Meta-learning

Hierarchically Structured Meta-learning

CreateAMind

27+阅读 · 2019年5月22日

【泡泡一分钟】用于评估视觉惯性里程计的TUM VI数据集

【泡泡一分钟】用于评估视觉惯性里程计的TUM VI数据集

泡泡机器人SLAM

11+阅读 · 2019年1月4日

Unsupervised Learning via Meta-Learning

Unsupervised Learning via Meta-Learning

CreateAMind

43+阅读 · 2019年1月3日

A Technical Overview of AI & ML in 2018 & Trends for 2019

A Technical Overview of AI & ML in 2018 & Trends for 2019

待字闺中

18+阅读 · 2018年12月24日

已删除

将门创投

4+阅读 · 2018年11月15日

disentangled-representation-papers

disentangled-representation-papers

CreateAMind

26+阅读 · 2018年9月12日

Hierarchical Disentangled Representations

Hierarchical Disentangled Representations

CreateAMind

4+阅读 · 2018年4月15日

【推荐】自然语言处理（NLP）指南

【推荐】自然语言处理（NLP）指南

机器学习研究会

35+阅读 · 2017年11月17日

【推荐】GAN架构入门综述(资源汇总)

【推荐】GAN架构入门综述(资源汇总)

机器学习研究会

10+阅读 · 2017年9月3日

【学习】Hierarchical Softmax

【学习】Hierarchical Softmax

机器学习研究会

4+阅读 · 2017年8月6日

相关论文

Demystifying statistical learning based on efficient influence functions

Arxiv

0+阅读 · 2021年9月27日

An Analysis into the Performance and Memory Usage of MATLAB Strings

Arxiv

0+阅读 · 2021年9月26日

Statistical Inference for Data Integration

Arxiv

0+阅读 · 2021年9月25日

Sample Efficient Model Evaluation

Sample Efficient Model Evaluation

Arxiv

0+阅读 · 2021年9月24日

Optimal policy evaluation using kernel-based temporal difference methods

Arxiv

0+阅读 · 2021年9月24日

Statistical Learning using Sparse Deep Neural Networks in Empirical Risk Minimization

Arxiv

0+阅读 · 2021年9月24日

GeomCA: Geometric Evaluation of Data Representations

GeomCA: Geometric Evaluation of Data Representations

Arxiv

11+阅读 · 2021年5月26日

Learning to Weight for Text Classification

Learning to Weight for Text Classification

Arxiv

8+阅读 · 2019年3月28日

Lessons from the Bible on Modern Topics: Low-Resource Multilingual Topic Model Evaluation

Arxiv

4+阅读 · 2018年4月26日

A Systematic Evaluation and Benchmark for Person Re-Identification: Features, Metrics, and Datasets

Arxiv

5+阅读 · 2018年2月14日

微信扫码咨询专知VIP会员