By harnessing pre-trained language models, summarization models had rapid progress recently. However, the models are mainly assessed by automatic evaluation metrics such as ROUGE. Although ROUGE is known for having a positive correlation with human evaluation scores, it has been criticized for its vulnerability and the gap between actual qualities. In this paper, we compare the generated summaries from recent LM, BART, and the reference summaries from a benchmark dataset, CNN/DM, using a crowd-sourced human evaluation metric. Interestingly, model-generated summaries receive higher scores relative to reference summaries. Stemming from our experimental results, we first argue the intrinsic characteristics of the CNN/DM dataset, the progress of pre-trained language models, and their ability to generalize on the training data. Finally, we share our insights into the model-generated summaries and presents our thought on learning methods for abstractive summarization.
翻译:通过利用经过培训的语言模型,总结模型最近取得了迅速的进展。然而,模型主要通过诸如ROUGE等自动评价指标进行评估。虽然ROUGE以与人类评价得分有积极关系而著称,但因其脆弱性和实际素质之间的差距而受到批评。在本文中,我们比较了最近LM、BART和基准数据集CNN/DM的参考摘要,即有线电视新闻网/DM使用众源人类评价指标。有趣的是,模型产生的摘要比参考摘要得分要高。我们从实验结果中总结了我们的实验结果,我们首先论证了CNN/DM数据集的内在特征、预先培训的语言模型的进展以及它们推广培训数据的能力。最后,我们分享了我们对模型生成摘要的见解,并介绍了我们对抽象和拼凑学习方法的想法。