MENLI: 基于自然语言推理的鲁棒性评价指标 (MENLI: Robust Evaluation Metrics from Natural Language Inference) - 专知论文

会员服务 ·

0

评价指标 · 标准基 · 自然语言推理 · 鲁棒 · 评价 ·

2023 年 4 月 3 日

MENLI: Robust Evaluation Metrics from Natural Language Inference

翻译：MENLI: 基于自然语言推理的鲁棒性评价指标

Yanran Chen,Steffen Eger

from arxiv, TACL 2023 camera-ready

Recently proposed BERT-based evaluation metrics for text generation perform well on standard benchmarks but are vulnerable to adversarial attacks, e.g., relating to information correctness. We argue that this stems (in part) from the fact that they are models of semantic similarity. In contrast, we develop evaluation metrics based on Natural Language Inference (NLI), which we deem a more appropriate modeling. We design a preference-based adversarial attack framework and show that our NLI based metrics are much more robust to the attacks than the recent BERT-based metrics. On standard benchmarks, our NLI based metrics outperform existing summarization metrics, but perform below SOTA MT metrics. However, when combining existing metrics with our NLI metrics, we obtain both higher adversarial robustness (15%-30%) and higher quality metrics as measured on standard benchmarks (+5% to 30%).

翻译：最近，针对文本生成提出的基于BERT的评价指标在标准基准上表现良好，但容易受到对信息正确性的对抗攻击。我们认为这部分原因在于它们是语义相似性模型。相比之下，我们开发了一种基于自然语言推理(NLI)的评价指标，认为这是一种更合适的模型。我们设计了一种基于偏好的对抗攻击框架，并展示了我们基于NLI的指标比最近的基于BERT的指标更具鲁棒性。在标准基准上，我们的NLI指标优于现有的摘要指标，但低于SOTA MT指标。然而，将现有的指标与我们的NLI指标相结合，我们获得了更高的对抗鲁棒性(15%-30%)和更高的标准基准质量指标(+5%至30%)。

0

相关内容

评价指标

AAAI 2022 | 基于预训练-微调框架的图像差异描述任务

AAAI 2022 | 基于预训练-微调框架的图像差异描述任务

专知会员服务

18+阅读 · 2022年2月26日

【AAAI2021】LRC-BERT：对比学习潜在语义知识蒸馏的自然语言理解

专知会员服务

27+阅读 · 2020年12月31日

【ICML2020】文本摘要生成模型PEGASUS

【ICML2020】文本摘要生成模型PEGASUS

专知会员服务

35+阅读 · 2020年8月23日

【视频描述综述论文】Video Description: A Survey of Methods, Datasets, and Evaluation Metrics

【视频描述综述论文】Video Description: A Survey of Methods, Datasets, and Evaluation Metrics

专知会员服务

65+阅读 · 2020年5月12日

【ACL2020-Google】学习鲁棒度量的文本生成，BLEURT: Learning Robust Metrics for Text Generation

【ACL2020-Google】学习鲁棒度量的文本生成，BLEURT: Learning Robust Metrics for Text Generation

专知会员服务

17+阅读 · 2020年4月10日

【AAAI2020】Context-Transformer:上下文转换器:解决对象混淆的小样本检测，Context-Transformer: Tackling Object Confusion for Few-Shot Detection

【AAAI2020】Context-Transformer:上下文转换器:解决对象混淆的小样本检测，Context-Transformer: Tackling Object Confusion for Few-Shot Detection

专知会员服务

51+阅读 · 2020年3月17日

【厦门大学-CVPR2020】协调可迁移性与可判别性的自适应目标检测器，Adapting Object Detectors

【厦门大学-CVPR2020】协调可迁移性与可判别性的自适应目标检测器，Adapting Object Detectors

专知会员服务

26+阅读 · 2020年3月16日

【微软雷德蒙研究院】小样本自然语言生成，Few-shot Natural Language Generation for Task-Oriented Dialog

【微软雷德蒙研究院】小样本自然语言生成，Few-shot Natural Language Generation for Task-Oriented Dialog

专知会员服务

33+阅读 · 2020年2月29日

【AAAI2020接受论文】预测性参与:开放领域对话系统自动评估的有效指标（Predictive Engagement: An Efficient Metric For Automatic Evaluation of Open-Domain Dialogue Systems）

【AAAI2020接受论文】预测性参与:开放领域对话系统自动评估的有效指标（Predictive Engagement: An Efficient Metric For Automatic Evaluation of Open-Domain Dialogue Systems）

专知会员服务

14+阅读 · 2019年11月15日

【中科院计算所 | 文献综述】自然语言生成的无监督前训练:文献综述，Unsupervised Pre-training for Natural Language Generation: A Literature Review

【中科院计算所 | 文献综述】自然语言生成的无监督前训练:文献综述，Unsupervised Pre-training for Natural Language Generation: A Literature Review

专知会员服务

48+阅读 · 2019年11月15日

GNN 新基准！Long Range Graph Benchmark

GNN 新基准！Long Range Graph Benchmark

图与推荐

0+阅读 · 2022年10月18日

RoBERTa中文预训练模型：RoBERTa for Chinese

RoBERTa中文预训练模型：RoBERTa for Chinese

PaperWeekly

57+阅读 · 2019年9月16日

RoBERTa for Chinese：大规模中文预训练RoBERTa模型

RoBERTa for Chinese：大规模中文预训练RoBERTa模型

AINLP

30+阅读 · 2019年9月8日

CVPR2019 | 15篇论文速递（涵盖目标检测、语义分割和姿态估计等方向）

CVPR2019 | 15篇论文速递（涵盖目标检测、语义分割和姿态估计等方向）

AI研习社

15+阅读 · 2019年5月8日

无监督元学习表示学习

无监督元学习表示学习

CreateAMind

27+阅读 · 2019年1月4日

Unsupervised Learning via Meta-Learning

Unsupervised Learning via Meta-Learning

CreateAMind

43+阅读 · 2019年1月3日

【论文推荐】最新6篇视觉问答（VQA）相关论文—目标推理、深度循环模型、可解释性、数据可视化、Triplet学习、基准

【论文推荐】最新6篇视觉问答（VQA）相关论文—目标推理、深度循环模型、可解释性、数据可视化、Triplet学习、基准

专知

15+阅读 · 2018年2月3日

论文笔记 | How NOT To Evaluate Your Dialogue System

论文笔记 | How NOT To Evaluate Your Dialogue System

科技创新与创业

13+阅读 · 2017年12月23日

ResNet, AlexNet, VGG, Inception：各种卷积网络架构的理解

ResNet, AlexNet, VGG, Inception：各种卷积网络架构的理解

全球人工智能

20+阅读 · 2017年12月17日

【推荐】ResNet, AlexNet, VGG, Inception：各种卷积网络架构的理解

【推荐】ResNet, AlexNet, VGG, Inception：各种卷积网络架构的理解

机器学习研究会

20+阅读 · 2017年12月17日

基于安全需求分析的内核保护方法研究

国家自然科学基金

2+阅读 · 2015年12月31日

基于大数据的消费金融信用风险计量与管理

国家自然科学基金

3+阅读 · 2014年12月31日

从血浆exosomes中筛选和鉴定预测EBV抗体阳性的体检人群患鼻咽癌风险的指标

国家自然科学基金

0+阅读 · 2014年12月31日

基于词向量表示的大规模知识图谱构建方法研究

国家自然科学基金

8+阅读 · 2014年12月31日

无监督分词及词性归纳联合方法研究

国家自然科学基金

1+阅读 · 2013年12月31日

云计算演化环境中的隐私建模与检测方法研究

国家自然科学基金

0+阅读 · 2012年12月31日

P53蛋白调节mTOR信号通路诱导胰腺癌吉西他滨耐药的机制研究

国家自然科学基金

0+阅读 · 2012年12月31日

Reality-based Interaction用户界面模型和评估方法研究

国家自然科学基金

0+阅读 · 2011年12月31日

基于高阶统计量和ARMA模型的高分辨率地震子波提取技术研究

国家自然科学基金

0+阅读 · 2009年12月31日

基于配价结构和话题结构的汉语句法分析和语义计算模型研究

国家自然科学基金

0+阅读 · 2009年12月31日

Are Large Language Models Good Evaluators for Abstractive Summarization?

Arxiv

0+阅读 · 2023年5月22日

GLUE-X: Evaluating Natural Language Understanding Models from an Out-of-distribution Generalization Perspective

Arxiv

0+阅读 · 2023年5月22日

Improving Metrics for Speech Translation

Arxiv

0+阅读 · 2023年5月22日

The Inside Story: Towards Better Understanding of Machine Translation Neural Evaluation Metrics

The Inside Story: Towards Better Understanding of Machine Translation Neural Evaluation Metrics

Arxiv

0+阅读 · 2023年5月19日

Controlling the Extraction of Memorized Data from Large Language Models via Prompt-Tuning

Arxiv

0+阅读 · 2023年5月19日

TrueTeacher: Learning Factual Consistency Evaluation with Large Language Models

Arxiv

0+阅读 · 2023年5月18日

CLEME: Debiasing Multi-reference Evaluation for Grammatical Error Correction

Arxiv

0+阅读 · 2023年5月18日

Flatness-Aware Prompt Selection Improves Accuracy and Sample Efficiency

Arxiv

0+阅读 · 2023年5月18日

A Survey of Explainable Graph Neural Networks: Taxonomy and Evaluation Metrics

Arxiv

14+阅读 · 2022年7月26日

An Overview on Machine Translation Evaluation

An Overview on Machine Translation Evaluation

Arxiv

14+阅读 · 2022年2月22日

VIP会员

文章信息

相关主题

自然语言推理

相关VIP内容

AAAI 2022 | 基于预训练-微调框架的图像差异描述任务

AAAI 2022 | 基于预训练-微调框架的图像差异描述任务

专知会员服务

18+阅读 · 2022年2月26日

【AAAI2021】LRC-BERT：对比学习潜在语义知识蒸馏的自然语言理解

专知会员服务

27+阅读 · 2020年12月31日

【ICML2020】文本摘要生成模型PEGASUS

【ICML2020】文本摘要生成模型PEGASUS

专知会员服务

35+阅读 · 2020年8月23日

【视频描述综述论文】Video Description: A Survey of Methods, Datasets, and Evaluation Metrics

【视频描述综述论文】Video Description: A Survey of Methods, Datasets, and Evaluation Metrics

专知会员服务

65+阅读 · 2020年5月12日

【ACL2020-Google】学习鲁棒度量的文本生成，BLEURT: Learning Robust Metrics for Text Generation

【ACL2020-Google】学习鲁棒度量的文本生成，BLEURT: Learning Robust Metrics for Text Generation

专知会员服务

17+阅读 · 2020年4月10日

【AAAI2020】Context-Transformer:上下文转换器:解决对象混淆的小样本检测，Context-Transformer: Tackling Object Confusion for Few-Shot Detection

【AAAI2020】Context-Transformer:上下文转换器:解决对象混淆的小样本检测，Context-Transformer: Tackling Object Confusion for Few-Shot Detection

专知会员服务

51+阅读 · 2020年3月17日

【厦门大学-CVPR2020】协调可迁移性与可判别性的自适应目标检测器，Adapting Object Detectors

【厦门大学-CVPR2020】协调可迁移性与可判别性的自适应目标检测器，Adapting Object Detectors

专知会员服务

26+阅读 · 2020年3月16日

【微软雷德蒙研究院】小样本自然语言生成，Few-shot Natural Language Generation for Task-Oriented Dialog

【微软雷德蒙研究院】小样本自然语言生成，Few-shot Natural Language Generation for Task-Oriented Dialog

专知会员服务

33+阅读 · 2020年2月29日

【AAAI2020接受论文】预测性参与:开放领域对话系统自动评估的有效指标（Predictive Engagement: An Efficient Metric For Automatic Evaluation of Open-Domain Dialogue Systems）

【AAAI2020接受论文】预测性参与:开放领域对话系统自动评估的有效指标（Predictive Engagement: An Efficient Metric For Automatic Evaluation of Open-Domain Dialogue Systems）

专知会员服务

14+阅读 · 2019年11月15日

【中科院计算所 | 文献综述】自然语言生成的无监督前训练:文献综述，Unsupervised Pre-training for Natural Language Generation: A Literature Review

【中科院计算所 | 文献综述】自然语言生成的无监督前训练:文献综述，Unsupervised Pre-training for Natural Language Generation: A Literature Review

专知会员服务

48+阅读 · 2019年11月15日

热门VIP内容

开通专知VIP会员享更多权益服务

《复杂工程系统模型驱动设计决策支持系统：早期设计阶段挑战》最新138页

《日本陆上自卫队2040年作战方式与未来作战研究》最新23页slides

人工智能作为战争武器

《后勤保障》最新23页

相关资讯

GNN 新基准！Long Range Graph Benchmark

GNN 新基准！Long Range Graph Benchmark

图与推荐

0+阅读 · 2022年10月18日

RoBERTa中文预训练模型：RoBERTa for Chinese

RoBERTa中文预训练模型：RoBERTa for Chinese

PaperWeekly

57+阅读 · 2019年9月16日

RoBERTa for Chinese：大规模中文预训练RoBERTa模型

RoBERTa for Chinese：大规模中文预训练RoBERTa模型

AINLP

30+阅读 · 2019年9月8日

CVPR2019 | 15篇论文速递（涵盖目标检测、语义分割和姿态估计等方向）

CVPR2019 | 15篇论文速递（涵盖目标检测、语义分割和姿态估计等方向）

AI研习社

15+阅读 · 2019年5月8日

无监督元学习表示学习

无监督元学习表示学习

CreateAMind

27+阅读 · 2019年1月4日

Unsupervised Learning via Meta-Learning

Unsupervised Learning via Meta-Learning

CreateAMind

43+阅读 · 2019年1月3日

【论文推荐】最新6篇视觉问答（VQA）相关论文—目标推理、深度循环模型、可解释性、数据可视化、Triplet学习、基准

【论文推荐】最新6篇视觉问答（VQA）相关论文—目标推理、深度循环模型、可解释性、数据可视化、Triplet学习、基准

专知

15+阅读 · 2018年2月3日

论文笔记 | How NOT To Evaluate Your Dialogue System

论文笔记 | How NOT To Evaluate Your Dialogue System

科技创新与创业

13+阅读 · 2017年12月23日

ResNet, AlexNet, VGG, Inception：各种卷积网络架构的理解

ResNet, AlexNet, VGG, Inception：各种卷积网络架构的理解

全球人工智能

20+阅读 · 2017年12月17日

【推荐】ResNet, AlexNet, VGG, Inception：各种卷积网络架构的理解

【推荐】ResNet, AlexNet, VGG, Inception：各种卷积网络架构的理解

机器学习研究会

20+阅读 · 2017年12月17日

相关论文

Are Large Language Models Good Evaluators for Abstractive Summarization?

Arxiv

0+阅读 · 2023年5月22日

GLUE-X: Evaluating Natural Language Understanding Models from an Out-of-distribution Generalization Perspective

Arxiv

0+阅读 · 2023年5月22日

Improving Metrics for Speech Translation

Arxiv

0+阅读 · 2023年5月22日

The Inside Story: Towards Better Understanding of Machine Translation Neural Evaluation Metrics

The Inside Story: Towards Better Understanding of Machine Translation Neural Evaluation Metrics

Arxiv

0+阅读 · 2023年5月19日

Controlling the Extraction of Memorized Data from Large Language Models via Prompt-Tuning

Arxiv

0+阅读 · 2023年5月19日

TrueTeacher: Learning Factual Consistency Evaluation with Large Language Models

Arxiv

0+阅读 · 2023年5月18日

CLEME: Debiasing Multi-reference Evaluation for Grammatical Error Correction

Arxiv

0+阅读 · 2023年5月18日

Flatness-Aware Prompt Selection Improves Accuracy and Sample Efficiency

Arxiv

0+阅读 · 2023年5月18日

A Survey of Explainable Graph Neural Networks: Taxonomy and Evaluation Metrics

Arxiv

14+阅读 · 2022年7月26日

An Overview on Machine Translation Evaluation

An Overview on Machine Translation Evaluation

Arxiv

14+阅读 · 2022年2月22日

相关基金

基于安全需求分析的内核保护方法研究

国家自然科学基金

2+阅读 · 2015年12月31日

基于大数据的消费金融信用风险计量与管理

国家自然科学基金

3+阅读 · 2014年12月31日

从血浆exosomes中筛选和鉴定预测EBV抗体阳性的体检人群患鼻咽癌风险的指标

国家自然科学基金

0+阅读 · 2014年12月31日

基于词向量表示的大规模知识图谱构建方法研究

国家自然科学基金

8+阅读 · 2014年12月31日

无监督分词及词性归纳联合方法研究

国家自然科学基金

1+阅读 · 2013年12月31日

云计算演化环境中的隐私建模与检测方法研究

国家自然科学基金

0+阅读 · 2012年12月31日

P53蛋白调节mTOR信号通路诱导胰腺癌吉西他滨耐药的机制研究

国家自然科学基金

0+阅读 · 2012年12月31日

Reality-based Interaction用户界面模型和评估方法研究

国家自然科学基金

0+阅读 · 2011年12月31日

基于高阶统计量和ARMA模型的高分辨率地震子波提取技术研究

国家自然科学基金

0+阅读 · 2009年12月31日

基于配价结构和话题结构的汉语句法分析和语义计算模型研究

国家自然科学基金

0+阅读 · 2009年12月31日

微信扫码咨询专知VIP会员