State-of-the-art summarization systems are trained and evaluated on massive datasets scraped from the web. Despite their prevalence, we know very little about the underlying characteristics (data noise, summarization complexity, etc.) of these datasets, and how these affect system performance and the reliability of automatic metrics like ROUGE. In this study, we manually analyze 600 samples from three popular summarization datasets. Our study is driven by a six-class typology which captures different noise types (missing facts, entities) and degrees of summarization difficulty (extractive, abstractive). We follow with a thorough analysis of 27 state-of-the-art summarization models and 5 popular metrics, and report our key insights: (1) Datasets have distinct data quality and complexity distributions, which can be traced back to their collection process. (2) The performance of models and reliability of metrics is dependent on sample complexity. (3) Faithful summaries often receive low scores because of the poor diversity of references. We release the code, annotated data and model outputs.
翻译:尽管这些数据集普遍存在,但我们对这些数据集的基本特征(数据噪音、汇总复杂性等)知之甚少,这些特征如何影响系统性能和像ROUGE这样的自动计量的可靠性。在这项研究中,我们手动分析了三个广受欢迎的汇总数据集的600个样本。我们的研究是由六类类型类型驱动的,它捕捉了不同的噪音类型(缺乏事实、实体)和汇总难度(极端、抽象)的程度。我们对这些数据集的基本特征(数据噪音、汇总复杂性等)进行透彻分析,并报告我们的主要见解:(1)数据集具有不同的数据质量和复杂性分布,可追溯到它们的收集过程。(2) 模型的性能和可靠性取决于抽样的复杂性。(3) 可靠的摘要往往由于引用的多样性而得分较低。我们发布了代码、附加说明的数据和模型产出。