Human evaluation for summarization tasks is reliable but brings in issues of reproducibility and high costs. Automatic metrics are cheap and reproducible but sometimes poorly correlated with human judgment. In this work, we propose flexible semiautomatic to automatic summary evaluation metrics, following the Pyramid human evaluation method. Semi-automatic Lite2Pyramid retains the reusable human-labeled Summary Content Units (SCUs) for reference(s) but replaces the manual work of judging SCUs' presence in system summaries with a natural language inference (NLI) model. Fully automatic Lite3Pyramid further substitutes SCUs with automatically extracted Semantic Triplet Units (STUs) via a semantic role labeling (SRL) model. Finally, we propose in-between metrics, Lite2.xPyramid, where we use a simple regressor to predict how well the STUs can simulate SCUs and retain SCUs that are more difficult to simulate, which provides a smooth transition and balance between automation and manual evaluation. Comparing to 15 existing metrics, we evaluate human-metric correlations on 3 existing meta-evaluation datasets and our newly-collected PyrXSum (with 100/10 XSum examples/systems). It shows that Lite2Pyramid consistently has the best summary-level correlations; Lite3Pyramid works better than or comparable to other automatic metrics; Lite2.xPyramid trades off small correlation drops for larger manual effort reduction, which can reduce costs for future data collection. Our code and data are publicly available at: https://github.com/ZhangShiyue/Lite2-3Pyramid
翻译:用于总和任务的人类评价是可靠的,但会带来重现和高成本的问题。 自动衡量标准是廉价的, 可复制的, 但有时与人类判断不相干。 在这项工作中, 我们提议采用金字塔人的评价方法, 自动半自动自动自动地自动进行简要评价指标。 半自动利他2金字塔保留可再使用的人类标签摘要内容单位( SCU) 用于参考, 却用自然语言的推断( NLI) 模型取代在系统摘要中判断 SCUs 存在的手工工作。 完全自动利特3Zyramid进一步用自动提取的Smantic Triplet 单位( STUs) 取代S 。 最后, 我们建议使用简单手动的回归器来预测 SCTUs 如何模拟 SUB/ Prdrim 的代码, 并保留更难模拟的 SCUCUS, 提供自动化和手动评估之间的平稳过渡和平衡。 比较到15 里德萨勒3, 我们不断的里萨平级数据采集数据, 数据系统在公开的比 。