全面评估对对话的评价 (A Comprehensive Assessment of Dialog Evaluation Metrics) - 专知论文

会员服务 ·

0

Better · MoDELS · state-of-the-art · SimPLe · 相关系数 ·

2021 年 7 月 2 日

A Comprehensive Assessment of Dialog Evaluation Metrics

翻译：全面评估对对话的评价

Yi-Ting Yeh,Maxine Eskenazi,Shikib Mehri

Automatic evaluation metrics are a crucial component of dialog systems research. Standard language evaluation metrics are known to be ineffective for evaluating dialog. As such, recent research has proposed a number of novel, dialog-specific metrics that correlate better with human judgements. Due to the fast pace of research, many of these metrics have been assessed on different datasets and there has as yet been no time for a systematic comparison between them. To this end, this paper provides a comprehensive assessment of recently proposed dialog evaluation metrics on a number of datasets. In this paper, 23 different automatic evaluation metrics are evaluated on 10 different datasets. Furthermore, the metrics are assessed in different settings, to better qualify their respective strengths and weaknesses. Metrics are assessed (1) on both the turn level and the dialog level, (2) for different dialog lengths, (3) for different dialog qualities (e.g., coherence, engaging), (4) for different types of response generation models (i.e., generative, retrieval, simple models and state-of-the-art models), (5) taking into account the similarity of different metrics and (6) exploring combinations of different metrics. This comprehensive assessment offers several takeaways pertaining to dialog evaluation metrics in general. It also suggests how to best assess evaluation metrics and indicates promising directions for future work.

翻译：自动化评价指标是对话系统研究的一个关键组成部分。标准语言评价指标据知对评价对话来说是无效的。因此,最近的研究提出了若干与人类判断更相干的新颖的、针对具体对话的衡量标准。由于研究速度快,许多这些衡量标准都对不同的数据集进行了评估,还没有时间对不同的数据集进行系统比较。为此,本文件全面评估了最近提议的关于若干数据集的对话评价指标。在本文件中,对10个不同的数据集评价了23个不同的自动评价指标。此外,对指标进行了不同环境的评估,以更好地确定各自的强项和弱点。对指标的评估(1) 在转弯层次和对话层次都进行了评估,(2) 不同的对话长度,(3) 不同的对话质量(例如,一致性,参与),(4) 不同类型的反应生成模型(例如,基因化、检索、简单模型和最新模型),(5) 考虑了不同指标的相似性,(6) 探索不同指标的组合。对不同指标的强项和弱点进行了评估。这一全面评估还提出了未来评估的最有希望的方式。

0

相关内容

Better

【CIKM2021】超链接预训练信息检索

专知会员服务

17+阅读 · 2021年8月24日

【深度学习社区检测】Deep Learning for Community Detection: Progress, Challenges and Opportunities

【深度学习社区检测】Deep Learning for Community Detection: Progress, Challenges and Opportunities

专知会员服务

28+阅读 · 2020年6月13日

【视频描述综述论文】Video Description: A Survey of Methods, Datasets, and Evaluation Metrics

【视频描述综述论文】Video Description: A Survey of Methods, Datasets, and Evaluation Metrics

专知会员服务

65+阅读 · 2020年5月12日

【WWW2020-北京大学】多模态多轮对话系统，Multi-Modality in Multi-Turn Dialog

【WWW2020-北京大学】多模态多轮对话系统，Multi-Modality in Multi-Turn Dialog

专知会员服务

58+阅读 · 2020年3月13日

【北航】深度学习编译器综述|The Deep Learning Compiler: A Comprehensive Survey

【北航】深度学习编译器综述|The Deep Learning Compiler: A Comprehensive Survey

专知会员服务

38+阅读 · 2020年2月11日

【AAAI2020接受论文】预测性参与:开放领域对话系统自动评估的有效指标（Predictive Engagement: An Efficient Metric For Automatic Evaluation of Open-Domain Dialogue Systems）

【AAAI2020接受论文】预测性参与:开放领域对话系统自动评估的有效指标（Predictive Engagement: An Efficient Metric For Automatic Evaluation of Open-Domain Dialogue Systems）

专知会员服务

14+阅读 · 2019年11月15日

社交网络上议题社群的公共焦虑研究，中国人民大学新闻学院塔娜讲师，第八届全国社会媒体处理大会SMP2019

社交网络上议题社群的公共焦虑研究，中国人民大学新闻学院塔娜讲师，第八届全国社会媒体处理大会SMP2019

专知会员服务

15+阅读 · 2019年10月23日

Deep Learning Based Detection and Correction of Cardiac MR Motion Artefacts During Reconstruction for High-Quality Segmentation

Deep Learning Based Detection and Correction of Cardiac MR Motion Artefacts During Reconstruction for High-Quality Segmentation

专知会员服务

59+阅读 · 2019年10月17日

机器学习入门的经验与建议

机器学习入门的经验与建议

专知会员服务

94+阅读 · 2019年10月10日

【哈佛大学商学院课程Fall 2019】机器学习可解释性

【哈佛大学商学院课程Fall 2019】机器学习可解释性

专知会员服务

105+阅读 · 2019年10月9日

分布式并行架构Ray介绍

分布式并行架构Ray介绍

CreateAMind

10+阅读 · 2019年8月9日

Hierarchically Structured Meta-learning

Hierarchically Structured Meta-learning

CreateAMind

27+阅读 · 2019年5月22日

A Technical Overview of AI & ML in 2018 & Trends for 2019

A Technical Overview of AI & ML in 2018 & Trends for 2019

待字闺中

18+阅读 · 2018年12月24日

已删除

将门创投

3+阅读 · 2018年11月20日

disentangled-representation-papers

disentangled-representation-papers

CreateAMind

26+阅读 · 2018年9月12日

推荐｜深度强化学习聊天机器人（附论文）！

推荐｜深度强化学习聊天机器人（附论文）！

全球人工智能

4+阅读 · 2018年1月30日

【论文推荐】最新5篇深度学习相关论文推介——感知度量、图像检索、联合视盘和视杯分割、谱聚类、MPI并行

【论文推荐】最新5篇深度学习相关论文推介——感知度量、图像检索、联合视盘和视杯分割、谱聚类、MPI并行

专知

6+阅读 · 2018年1月15日

多轮对话之对话管理：Dialog Management

多轮对话之对话管理：Dialog Management

PaperWeekly

18+阅读 · 2018年1月15日

【论文】变分推断（Variational inference)的总结

【论文】变分推断（Variational inference)的总结

机器学习研究会

39+阅读 · 2017年11月16日

A Topic Coverage Approach to Evaluation of Topic Models

Arxiv

0+阅读 · 2021年9月2日

The Deep Learning Compiler: A Comprehensive Survey

Arxiv

5+阅读 · 2020年8月28日

Predictive Engagement: An Efficient Metric For Automatic Evaluation of Open-Domain Dialogue Systems

Predictive Engagement: An Efficient Metric For Automatic Evaluation of Open-Domain Dialogue Systems

Arxiv

11+阅读 · 2019年11月4日

MoverScore: Text Generation Evaluating with Contextualized Embeddings and Earth Mover Distance

MoverScore: Text Generation Evaluating with Contextualized Embeddings and Earth Mover Distance

Arxiv

3+阅读 · 2019年9月26日

Towards Automated Machine Learning: Evaluation and Comparison of AutoML Approaches and Tools

Towards Automated Machine Learning: Evaluation and Comparison of AutoML Approaches and Tools

Arxiv

3+阅读 · 2019年9月3日

Context in Neural Machine Translation: A Review of Models and Evaluations

Arxiv

5+阅读 · 2019年1月25日

Metrics for Explainable AI: Challenges and Prospects

Metrics for Explainable AI: Challenges and Prospects

Arxiv

4+阅读 · 2018年12月11日

A Systematic Evaluation and Benchmark for Person Re-Identification: Features, Metrics, and Datasets

Arxiv

5+阅读 · 2018年2月14日

The Unreasonable Effectiveness of Deep Features as a Perceptual Metric

Arxiv

11+阅读 · 2018年1月11日

Survey of the State of the Art in Natural Language Generation: Core tasks, applications and evaluation

Arxiv

5+阅读 · 2017年12月12日

VIP会员

文章信息

相关主题

state-of-the-art

相关VIP内容

【CIKM2021】超链接预训练信息检索

专知会员服务

17+阅读 · 2021年8月24日

【深度学习社区检测】Deep Learning for Community Detection: Progress, Challenges and Opportunities

【深度学习社区检测】Deep Learning for Community Detection: Progress, Challenges and Opportunities

专知会员服务

28+阅读 · 2020年6月13日

【视频描述综述论文】Video Description: A Survey of Methods, Datasets, and Evaluation Metrics

【视频描述综述论文】Video Description: A Survey of Methods, Datasets, and Evaluation Metrics

专知会员服务

65+阅读 · 2020年5月12日

【WWW2020-北京大学】多模态多轮对话系统，Multi-Modality in Multi-Turn Dialog

【WWW2020-北京大学】多模态多轮对话系统，Multi-Modality in Multi-Turn Dialog

专知会员服务

58+阅读 · 2020年3月13日

【北航】深度学习编译器综述|The Deep Learning Compiler: A Comprehensive Survey

【北航】深度学习编译器综述|The Deep Learning Compiler: A Comprehensive Survey

专知会员服务

38+阅读 · 2020年2月11日

【AAAI2020接受论文】预测性参与:开放领域对话系统自动评估的有效指标（Predictive Engagement: An Efficient Metric For Automatic Evaluation of Open-Domain Dialogue Systems）

【AAAI2020接受论文】预测性参与:开放领域对话系统自动评估的有效指标（Predictive Engagement: An Efficient Metric For Automatic Evaluation of Open-Domain Dialogue Systems）

专知会员服务

14+阅读 · 2019年11月15日

社交网络上议题社群的公共焦虑研究，中国人民大学新闻学院塔娜讲师，第八届全国社会媒体处理大会SMP2019

社交网络上议题社群的公共焦虑研究，中国人民大学新闻学院塔娜讲师，第八届全国社会媒体处理大会SMP2019

专知会员服务

15+阅读 · 2019年10月23日

Deep Learning Based Detection and Correction of Cardiac MR Motion Artefacts During Reconstruction for High-Quality Segmentation

Deep Learning Based Detection and Correction of Cardiac MR Motion Artefacts During Reconstruction for High-Quality Segmentation

专知会员服务

59+阅读 · 2019年10月17日

机器学习入门的经验与建议

机器学习入门的经验与建议

专知会员服务

94+阅读 · 2019年10月10日

【哈佛大学商学院课程Fall 2019】机器学习可解释性

【哈佛大学商学院课程Fall 2019】机器学习可解释性

专知会员服务

105+阅读 · 2019年10月9日

热门VIP内容

开通专知VIP会员享更多权益服务

【新书】面向企业的图学习扩展：生产级图学习与推理，485页pdf

AI智能体编程：技术、挑战与机遇综述

【国家标准】数据安全技术数据安全风险评估方法

【CMU博士论文】交互式学习的进展：替代性反馈机制与自适应因果推理

相关资讯

分布式并行架构Ray介绍

分布式并行架构Ray介绍

CreateAMind

10+阅读 · 2019年8月9日

Hierarchically Structured Meta-learning

Hierarchically Structured Meta-learning

CreateAMind

27+阅读 · 2019年5月22日

A Technical Overview of AI & ML in 2018 & Trends for 2019

A Technical Overview of AI & ML in 2018 & Trends for 2019

待字闺中

18+阅读 · 2018年12月24日

已删除

将门创投

3+阅读 · 2018年11月20日

disentangled-representation-papers

disentangled-representation-papers

CreateAMind

26+阅读 · 2018年9月12日

推荐｜深度强化学习聊天机器人（附论文）！

推荐｜深度强化学习聊天机器人（附论文）！

全球人工智能

4+阅读 · 2018年1月30日

【论文推荐】最新5篇深度学习相关论文推介——感知度量、图像检索、联合视盘和视杯分割、谱聚类、MPI并行

【论文推荐】最新5篇深度学习相关论文推介——感知度量、图像检索、联合视盘和视杯分割、谱聚类、MPI并行

专知

6+阅读 · 2018年1月15日

多轮对话之对话管理：Dialog Management

多轮对话之对话管理：Dialog Management

PaperWeekly

18+阅读 · 2018年1月15日

【论文】变分推断（Variational inference)的总结

【论文】变分推断（Variational inference)的总结

机器学习研究会

39+阅读 · 2017年11月16日

相关论文

A Topic Coverage Approach to Evaluation of Topic Models

Arxiv

0+阅读 · 2021年9月2日

The Deep Learning Compiler: A Comprehensive Survey

Arxiv

5+阅读 · 2020年8月28日

Predictive Engagement: An Efficient Metric For Automatic Evaluation of Open-Domain Dialogue Systems

Predictive Engagement: An Efficient Metric For Automatic Evaluation of Open-Domain Dialogue Systems

Arxiv

11+阅读 · 2019年11月4日

MoverScore: Text Generation Evaluating with Contextualized Embeddings and Earth Mover Distance

MoverScore: Text Generation Evaluating with Contextualized Embeddings and Earth Mover Distance

Arxiv

3+阅读 · 2019年9月26日

Towards Automated Machine Learning: Evaluation and Comparison of AutoML Approaches and Tools

Towards Automated Machine Learning: Evaluation and Comparison of AutoML Approaches and Tools

Arxiv

3+阅读 · 2019年9月3日

Context in Neural Machine Translation: A Review of Models and Evaluations

Arxiv

5+阅读 · 2019年1月25日

Metrics for Explainable AI: Challenges and Prospects

Metrics for Explainable AI: Challenges and Prospects

Arxiv

4+阅读 · 2018年12月11日

A Systematic Evaluation and Benchmark for Person Re-Identification: Features, Metrics, and Datasets

Arxiv

5+阅读 · 2018年2月14日

The Unreasonable Effectiveness of Deep Features as a Perceptual Metric

Arxiv

11+阅读 · 2018年1月11日

Survey of the State of the Art in Natural Language Generation: Core tasks, applications and evaluation

Arxiv

5+阅读 · 2017年12月12日

微信扫码咨询专知VIP会员