Recently has there been a growing interest in the creation of text generation systems from a discourse coherence perspective, e.g., modeling the interdependence between sentences. Still, recent BERT-based evaluation metrics cannot recognize coherence and fail to punish incoherent elements in system outputs. In this work, we introduce DiscoScore, a discourse metric with multiple variants, which uses BERT to model discourse coherence from different perspectives, driven by Centering theory. Our experiments encompass 16 non-discourse and discourse metrics, including DiscoScore and popular coherence models, evaluated on summarization and document-level machine translation (MT). We find that (i) the majority of BERT-based metrics correlate much worse with human rated coherence than early discourse metrics, invented a decade ago; (ii) the recent state-of-the-art BARTScore is weak when operated at system level -- which is particularly problematic as systems are typically compared in this manner. DiscoScore, in contrast, achieves strong system-level correlation with human ratings, not only in coherence but also in factual consistency and other aspects, and surpasses BARTScore by over 10 correlation points on average. Further, aiming to understand DiscoScore, we provide justifications to the importance of discourse coherence for evaluation metrics, and explain the superiority of one variant over another. Our code is available at \url{https://github.com/AIPHES/DiscoScore}.
翻译:最近人们越来越关心从讨论一致性的角度创建文本生成系统,例如模拟判决之间的相互依存性。不过,最近基于BERT的评价指标不能承认一致性,不能惩罚系统产出中的不一致元素。在这项工作中,我们引入了DiscoScore,这是一个包含多种变体的谈话标准,它利用BERT从不同的角度模拟对话的一致性,由中心理论驱动。我们的实验包括16个非讨论和讨论指标,包括DiscoScore和大众一致性模型,对汇总和文件级机器翻译(MT)进行评估。我们发现,(一) 以BERT为基础的大多数指标与十年前发明的早期讨论指标相比,与人类评级的一致性程度大得多;(二) 最近的“DiscoScore”,当系统一级运行时,这种状态较弱,因为系统通常以这种方式进行比较,这特别成问题。 与人类评级的系统级相关性非常密切,不仅在一致性方面,而且在其他方面,而且超越了BARST/Ricolority,我们在10个相关版本中超越了DARS的比标度,我们以一个相对性解释了另一个相对性原则。