Recently, there has been a growing interest in designing text generation systems from a discourse coherence perspective, e.g., modeling the interdependence between sentences. Still, recent BERT-based evaluation metrics are weak in recognizing coherence, and thus are not reliable in a way to spot the discourse-level improvements of those text generation systems. In this work, we introduce DiscoScore, a parametrized discourse metric, which uses BERT to model discourse coherence from different perspectives, driven by Centering theory. Our experiments encompass 16 non-discourse and discourse metrics, including DiscoScore and popular coherence models, evaluated on summarization and document-level machine translation (MT). We find that (i) the majority of BERT-based metrics correlate much worse with human rated coherence than early discourse metrics, invented a decade ago; (ii) the recent state-of-the-art BARTScore is weak when operated at system level -- which is particularly problematic as systems are typically compared in this manner. DiscoScore, in contrast, achieves strong system-level correlation with human ratings, not only in coherence but also in factual consistency and other aspects, and surpasses BARTScore by over 10 correlation points on average. Further, aiming to understand DiscoScore, we provide justifications to the importance of discourse coherence for evaluation metrics, and explain the superiority of one variant over another. Our code is available at \url{https://github.com/AIPHES/DiscoScore}.
翻译:最近,人们越来越有兴趣从讨论一致性的角度设计文本生成系统,例如模拟判决之间的相互依存关系。不过,最近基于BERT的评价指标在承认一致性方面薄弱,因此在发现这些文本生成系统的谈话层面改进方面不可靠。在这项工作中,我们引入了DiscoScore, 这是一种平衡化的谈话衡量标准,它利用BERT从不同的角度模拟讨论一致性问题。我们的实验包括16个非讨论和讨论指标,包括DiscoScore和流行性的一致性模型,对总结和文件级机器翻译(MT)进行评价。我们发现,(一) 以BERT为基础的大多数指标与十年前发明的人类评分的一致性指标相比,比早期的评分标准要差得多;(二) 最近在系统一级运行时,最先进的BARts-ESScore(DS) 特别困难,因为系统通常以这种方式进行比较。DiscoScocore, 系统与人类评级的高度关联性不仅在一致性方面,而且还在实际一致性和可持续性方面,我们更能理解DARS(DR)的相对性, 更能解释我们更能解释。